PBM restores not working

tl; dr - restoring sharded mongodb backup created via percona backup for mongodb not working as expected.

Description

Attempts to restore a mongodb backup that was downloaded from s3 to the filesystem are failing.

The error message which is printed is that

..Error: waiting for start: cluster failed: prepare snapshot: failed to ensure snapshot file 2021-08-04T11:50:33Z_shardrs01.dump.s2: no such file

However, the file 2021-08-04T11:50:33Z_shardrs01.dump.s2 exists within the backup path:

$ ls -la
total 14388300
drwxr-xr-x 2 pbm  pbm        4096 Aug  4 18:48 .
drwxr-xr-x 4 pbm  pbm        4096 Aug  4 18:01 ..
-rw-r--r-- 1 pbm  pbm           5 Aug  4 16:47 .pbm.init
-rw-r--r-- 1 root root       3158 Aug  4 18:48 2021-08-04T11:50:33Z.pbm.json
-rw-r--r-- 1 root root       3289 Aug  4 18:47 2021-08-04T11:50:33Z.pbm.json.backup
-rw-r--r-- 1 pbm  pbm     1756198 Aug  4 17:20 2021-08-04T11:50:33Z_configrs.dump.s2
-rw-r--r-- 1 pbm  pbm       43855 Aug  4 17:25 2021-08-04T11:50:33Z_configrs.oplog.s2
-rw-r--r-- 1 pbm  pbm  7300133194 Aug  4 17:20 2021-08-04T11:50:33Z_shardrs01.dump.s2
-rw-r--r-- 1 pbm  pbm      203819 Aug  4 17:25 2021-08-04T11:50:33Z_shardrs01.oplog.s2
-rw-r--r-- 1 pbm  pbm  7431279565 Aug  4 17:20 2021-08-04T11:50:33Z_shardrs02.dump.s2
-rw-r--r-- 1 pbm  pbm      159863 Aug  4 17:25 2021-08-04T11:50:33Z_shardrs02.oplog.s2

Basically, pbm says that a required file doesn’t exist, but that isn’t true.

What was done

  • We had configured percona backup for mongodb to backup our production mongodb sharded database to s3. The configuration for that is as follows:
pitr:
  enabled: false
storage:
  type: s3
  s3:
    provider: aws
    region: $region
    bucket: $bucket
    prefix: $prefix
    credentials:
      access-key-id: '#SECRET'
      secret-access-key: #SECRET'

Our end goal was to restore this backup to an identical mongodb sharded database cluster in a non-production environment. for that, we have this configuration:

pitr:
  enabled: false
storage:
  type: filesystem
  filesystem:
    path: /var/lib/pbm/backups/
restore:
  batchSize: 350
  numInsertionWorkers: 1

for the above config, we have ensured that the owner is the pbm user for the /var/lib/pbm directory.

Once both configurations were setup, we made a backup of the production cluster.

On the staging side, we downloaded those backups from s3 to the pbm backup path on the filesystem, which was /var/lib/pbm/backups.

we altered the store key of that backup’s .pbm.json file this way to let pbm know that the backup was created on the filesystem:

{
  "type": "filesystem",
  "s3": {},
  "azure": {},
  "filesystem": {
    "path": "/var/lib/pbm/backups/"
  }
}

once that was done, we issued

pbm config --force-resync

We got the backups to show up in our list of backups via pbm list:

$ pbm list
Backup snapshots:
  2021-08-04T11:50:33Z [complete: 2021-08-04T11:55:30]

What happened

We tried to issue

pbm restore '2021-08-04T11:50:33Z'

This resulted in the following error:

$ pbm restore '2021-08-04T11:50:33Z'                                                  
..Error: waiting for start: cluster failed: prepare snapshot: failed to ensure snapshot file 2021-08-04T11:50:33Z_shardrs01.dump.s2: no such file

We’re completely baffled by this issue and pointers on this would be of immense help.

Hi @Prashant_Warrier

Do all agents on all shards have access to the very same /var/lib/pbm/backups/ with all files? Is it NFS?

1 Like

All agents on all shards have access to the /var/lib/pbm/backups directory - in the sense that that directory exists on each node within each replica within each shard.

It is not NFS.

1 Like

Oh, I see. All agents have to have access to the very same directory with the same files. Meaning it should be either S3-king storage or some kind of network file system in case of storage.type: filesystem. The idea is that you never know what exactly node will make a backup. Even with the backup priority option there is no 100% guarantee as the node may be down etc. Moreover, the node that will perform restore most probably gonna be the different node that did a backup - for the restore it’s should be a primary node as for the backup preference given to the secondary nodes. So all agents should have access to all of the backup files.

1 Like