Hi there!
I am attempting to restore physical backup followed by PITR restore but I am struggling to restore PITR backup. For some reason, every attempt ends up with one (or more) shards reporting that the sharding status of collection it’s currently trying to restore is unknown and fails.
Log:
2023-06-22T15:37:34Z E [sh3/site1-db3:27017] [replay/2023-06-22T15:37:32.140759414Z] oplog replay: replay chunk 1687421969.1687422720: apply oplog for chunk: applying an entry: op: {"Timestamp":{"T":1687421976,"I":3},"Term":12,"Hash":null,"Version":2,"Operation":"i","Namespace":"some-database.some-collection","Object":["REDACTED"]} | merr <nil>: applyOps: (StaleConfig) sharding status of collection some-database.some-collection is not currently known and needs to be recovered
2023-06-22T15:37:35Z E [cfgsrv/site1-cfgsrv1:27017] [replay/2023-06-22T15:37:32.140759414Z] oplog replay: waiting for running: cluster failed: waiting for running: cluster failed: waiting for running: cluster failed: replay chunk 1687421969.1687422720: apply oplog for chunk: applying an entry: op: {"Timestamp":{"T":1687421976,"I":3},"Term":12,"Hash":null,"Version":2,"Operation":"i","Namespace":"some-database.some-collection","Object":["REDACTED"]} | merr <nil>: applyOps: (StaleConfig) sharding status of collection some-database.some-collection is not currently known and needs to be recovered
pbm logs -sD:
2023-06-22T15:37:33Z I [sh3/site2-db3:27017] got command replay <ts: 1687448252>
2023-06-22T15:37:33Z I [sh1/site1-db1:27017] [replay/2023-06-22T15:37:32.140759414Z] time: 2023-06-22T08:18:33Z-2023-06-22T08:20:00Z
2023-06-22T15:37:33Z I [sh3/site2-db3:27017] got epoch {1687447708 1961}
2023-06-22T15:37:33Z I [sh1/site1-db1:27017] [replay/2023-06-22T15:37:32.140759414Z] node in not suitable for restore
2023-06-22T15:37:33Z I [sh3/site2-db3:27017] [replay/2023-06-22T15:37:32.140759414Z] time: 2023-06-22T08:18:33Z-2023-06-22T08:20:00Z
2023-06-22T15:37:33Z I [sh3/site2-db3:27017] [replay/2023-06-22T15:37:32.140759414Z] node in not suitable for restore
2023-06-22T15:37:33Z I [cfgsrv/site1-cfgsrv1:27017] [replay/2023-06-22T15:37:32.140759414Z] moving to state running
2023-06-22T15:37:33Z I [sh3/site1-db3:27017] [replay/2023-06-22T15:37:32.140759414Z] moving to state running
2023-06-22T15:37:33Z I [sh2/site1-db2:27017] [replay/2023-06-22T15:37:32.140759414Z] moving to state running
2023-06-22T15:37:34Z I [sh1/site3-db1:27017] [replay/2023-06-22T15:37:32.140759414Z] moving to state running
2023-06-22T15:37:34Z I [sh3/site1-db3:27017] [replay/2023-06-22T15:37:32.140759414Z] starting oplog replay
2023-06-22T15:37:34Z D [sh3/site1-db3:27017] [replay/2023-06-22T15:37:32.140759414Z] starting sharded txn sync
2023-06-22T15:37:34Z D [sh3/site1-db3:27017] [replay/2023-06-22T15:37:32.140759414Z] + applying {sh3 pbmPitr/sh3/20230622/20230622081547-70.20230622081929-1.oplog.s2 s2 {1687421747 70} {1687421969 1} 7041}
2023-06-22T15:37:34Z D [sh3/site1-db3:27017] [replay/2023-06-22T15:37:32.140759414Z] + applying {sh3 pbmPitr/sh3/20230622/20230622081929-1.20230622083200-12.oplog.s2 s2 {1687421969 1} {1687422720 12} 51542}
2023-06-22T15:37:34Z D [sh3/site1-db3:27017] [replay/2023-06-22T15:37:32.140759414Z] exit waitingTxnChecker
2023-06-22T15:37:34Z D [sh3/site1-db3:27017] [replay/2023-06-22T15:37:32.140759414Z] exit distTxnChecker
2023-06-22T15:37:34Z E [sh3/site1-db3:27017] [replay/2023-06-22T15:37:32.140759414Z] oplog replay: replay chunk 1687421969.1687422720: apply oplog for chunk: applying an entry: op: {"Timestamp":{"T":1687421976,"I":3},"Term":12,"Hash":null,"Version":2,"Operation":"i","Namespace":"some-database.some-collection","Object":["REDACTED"],"Query":[{"Key":"_id","Value":{"Subtype":3,"Data":"jUEOyiIZ5sDkI7sI0DaLvQ=="}}],"UI":{"Subtype":4,"Data":"uQaTmtZCQNKLLG1TfjJxrw=="},"LSID":"SAAAAAVpZAAQAAAABOtjZj+ZZ0VhlqN8jTxlFYwFdWlkACAAAAAAj8EN34MjH7wHaXMcd5BtC6c6MFz56gFtF3PWt9HdCMYA","TxnNumber":18,"PrevOpTime":"HAAAABF0cwAAAAAAAAAAABJ0AP//////////AA=="} | merr <nil>: applyOps: (StaleConfig) sharding status of collection some-database.some-collection is not currently known and needs to be recovered
2023-06-22T15:37:34Z E [sh2/site1-db2:27017] [replay/2023-06-22T15:37:32.140759414Z] oplog replay: waiting for running: cluster failed: replay chunk 1687421969.1687422720: apply oplog for chunk: applying an entry: op: {"Timestamp":{"T":1687421976,"I":3},"Term":12,"Hash":null,"Version":2,"Operation":"i","Namespace":"some-database.some-collection","Object":["REDACTED"],"Query":[{"Key":"_id","Value":{"Subtype":3,"Data":"jUEOyiIZ5sDkI7sI0DaLvQ=="}}],"UI":{"Subtype":4,"Data":"uQaTmtZCQNKLLG1TfjJxrw=="},"LSID":"SAAAAAVpZAAQAAAABOtjZj+ZZ0VhlqN8jTxlFYwFdWlkACAAAAAAj8EN34MjH7wHaXMcd5BtC6c6MFz56gFtF3PWt9HdCMYA","TxnNumber":18,"PrevOpTime":"HAAAABF0cwAAAAAAAAAAABJ0AP//////////AA=="} | merr <nil>: applyOps: (StaleConfig) sharding status of collection some-database.some-collection is not currently known and needs to be recovered
2023-06-22T15:37:35Z E [sh1/site3-db1:27017] [replay/2023-06-22T15:37:32.140759414Z] oplog replay: waiting for running: cluster failed: waiting for running: cluster failed: replay chunk 1687421969.1687422720: apply oplog for chunk: applying an entry: op: {"Timestamp":{"T":1687421976,"I":3},"Term":12,"Hash":null,"Version":2,"Operation":"i","Namespace":"some-database.some-collection","Object":["REDACTED"],"Query":[{"Key":"_id","Value":{"Subtype":3,"Data":"jUEOyiIZ5sDkI7sI0DaLvQ=="}}],"UI":{"Subtype":4,"Data":"uQaTmtZCQNKLLG1TfjJxrw=="},"LSID":"SAAAAAVpZAAQAAAABOtjZj+ZZ0VhlqN8jTxlFYwFdWlkACAAAAAAj8EN34MjH7wHaXMcd5BtC6c6MFz56gFtF3PWt9HdCMYA","TxnNumber":18,"PrevOpTime":"HAAAABF0cwAAAAAAAAAAABJ0AP//////////AA=="} | merr <nil>: applyOps: (StaleConfig) sharding status of collection some-database.some-collection is not currently known and needs to be recovered
2023-06-22T15:37:35Z E [cfgsrv/site1-cfgsrv1:27017] [replay/2023-06-22T15:37:32.140759414Z] oplog replay: waiting for running: cluster failed: waiting for running: cluster failed: waiting for running: cluster failed: replay chunk 1687421969.1687422720: apply oplog for chunk: applying an entry: op: {"Timestamp":{"T":1687421976,"I":3},"Term":12,"Hash":null,"Version":2,"Operation":"i","Namespace":"some-database.some-collection","Object":["REDACTED"],"Query":[{"Key":"_id","Value":{"Subtype":3,"Data":"jUEOyiIZ5sDkI7sI0DaLvQ=="}}],"UI":{"Subtype":4,"Data":"uQaTmtZCQNKLLG1TfjJxrw=="},"LSID":"SAAAAAVpZAAQAAAABOtjZj+ZZ0VhlqN8jTxlFYwFdWlkACAAAAAAj8EN34MjH7wHaXMcd5BtC6c6MFz56gFtF3PWt9HdCMYA","TxnNumber":18,"PrevOpTime":"HAAAABF0cwAAAAAAAAAAABJ0AP//////////AA=="} | merr <nil>: applyOps: (StaleConfig) sharding status of collection some-database.some-collection is not currently known and needs to be recovered
Because we only take physical backups + PITR, PITR was configured as follows:
pbm config --set pitr.enabled=true
pbm config --set pitr.oplogOnly=true
PBM status:
Cluster:
========
sh3:
- sh3/site1-db3:27017 [P]: pbm-agent v2.1.0 OK
- sh3/site2-db3:27017 [S]: pbm-agent v2.1.0 OK
- sh3/site3-db3:27017 [S]: pbm-agent v2.1.0 OK
sh2:
- sh2/site1-db2:27017 [P]: pbm-agent v2.1.0 OK
- sh2/site2-db2:27017 [S]: pbm-agent v2.1.0 OK
- sh2/site3-db2:27017 [S]: pbm-agent v2.1.0 OK
cfgsrv:
- cfgsrv/site1-cfgsrv1:27017 [P]: pbm-agent v2.1.0 OK
- cfgsrv/site2-cfgsrv1:27017 [S]: pbm-agent v2.1.0 OK
- cfgsrv/site3-cfgsrv1:27017 [S]: pbm-agent v2.1.0 OK
sh1:
- sh1/site1-db1:27017 [S]: pbm-agent v2.1.0 OK
- sh1/site2-db1:27017 [S]: pbm-agent v2.1.0 OK
- sh1/site3-db1:27017 [P]: pbm-agent v2.1.0 OK
PITR incremental backup:
========================
Status [OFF]
Currently running:
==================
(none)
Backups:
========
FS /backup/db
Snapshots:
2023-06-22T08:19:31Z 15.84GB <physical> [restore_to_time: 2023-06-22T08:19:34Z]
2023-06-21T09:44:41Z 15.86GB <physical> [restore_to_time: 2023-06-21T09:44:44Z]
PITR chunks [6.08GB]:
2023-06-21T09:44:45Z - 2023-06-22T08:52:00Z
2023-05-15T07:56:12Z - 2023-06-21T09:44:44Z (no base snapshot)
The procedure I followed is below:
# Begin restore
pbm restore 2023-06-22T08:19:31Z
# Monitor restore
pbm describe-restore 2023-06-22T15:07:49.510696956Z -c pbm_config.yaml
# Once above command returns "status: done" restart mongod and pbm-agent on all shards/config servers
systemctl restart mongod; systemctl restart pbm-agent
# Force storage sync
pbm config --force-resync
# Attempt to replay oplog (1s behind "restore_to_time" of physical backup I just restored)
pbm oplog-replay --start="2023-06-22T08:19:33" --end="2023-06-22T08:20:00"
OS: Rocky Linux release 9.2 (we’ve also tested it on Ubuntu 22.04)
PBM: percona-backup-mongodb-2.1.0
Percona MongoDB: 6.0.6-5
SELinux off
Please advise.