I performed a pitr restore of a physical backup on three nodes. From all indication it appears the restore completed successfully as i observed the size of /data on the three nodes are equivalent after the restore. The backup was taken with a pbm 2.7.0 and was restored with pbm 2.11.0. (i don’t think this is a problem as backward compatibility is supported per pbm docs).
However, what i observed is the pbm-agent on one of the three nodes (host-mongod-rs01-02) does not capture any log to say the recovery completed successfully. Although the timing of generation mongod.lock on the problematic node aligns with one of the other working node. Also when i checked the pbm.restore.log on the problematic node, i see indications of the recovery completing successfully.
My concern right now is why was the pbm-agent on the problematic node not shutdown or why was the cleanup incomplete as when i checked the logs after the restore it appears restore was still ongoing whiich was a bit misleading until i validated with /data and pbm.restore.log. Could this be a bug as i’m not sure why the behaviour wasn’t consistent with the other nodes. I had to shutdown the pbm-agent manually on this node. I will share logs of this, hope it helps. if this is a bug, please guide on how to create a ticket for this.
pbm version.
root@host-rs01-02:~# pbm status
Cluster:
========
rs01:
- host-mongod-rs01-01.com:27018 [S]: pbm-agent [v2.11.0] OK
- host-mongod-rs01-02.com:27018 [P]: pbm-agent [v2.11.0] OK
- host-mongod-rs01-03.com:27018 [S]: pbm-agent [v2.11.0] OK
problematic node
root@host-mongod-rs01-02:~# journalctl -u pbm-agent -f
Feb 04 09:25:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:25:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/cluster.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:27:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:27:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/rs.rs01/node.host-mongod-rs01-02.com:27018.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:27:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:27:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/rs.rs01/rs.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:27:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:27:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/cluster.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:29:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:29:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/rs.rs01/node.host-mongod-rs01-02.com:27018.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:29:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:29:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/rs.rs01/rs.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:29:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:29:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/cluster.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:31:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:31:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/rs.rs01/node.host-mongod-rs01-02.com:27018.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
root@host-mongod-rs01-02:/data# grep -i complete pbm.restore.log | tail
{"t":{"$date":"2026-02-04T00:54:11.561+00:00"},"s":"I", "c":"WTRECOV", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166451,"ts_usec":561782,"thread":"1417953:0x7492fd855d40","session_name":"txn-recover","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"recovery was completed successfully and took 354ms, including 293ms for the log replay, 47ms for the rollback to stable, and 14ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:54:13.042+00:00"},"s":"I", "c":"STORAGE", "id":7393205, "ctx":"initandlisten","msg":"Collection sampling complete","attr":{"namespace":"local.oplog.rs"}}
{"t":{"$date":"2026-02-04T00:54:13.462+00:00"},"s":"I", "c":"CONTROL", "id":8423403, "ctx":"initandlisten","msg":"mongod startup complete","attr":{"Summary of time elapsed":{"Startup from clean shutdown?":true,"Statistics":{"Transport layer setup":"0 ms","Run initial syncer crash recovery":"0 ms","Create storage engine lock file in the data directory":"0 ms","Get metadata describing storage engine":"0 ms","Validate options in metadata against current startup options":"0 ms","Create storage engine":"2522 ms","Write current PID to file":"0 ms","Initialize FCV before rebuilding indexes":"76 ms","Drop abandoned idents and get back indexes that need to be rebuilt or builds that need to be restarted":"1 ms","Rebuild indexes for collections":"0 ms","Build user and roles graph":"0 ms","Verify indexes for admin.system.users collection":"0 ms","Verify indexes for admin.system.roles collection":"0 ms","Set up the background thread pool responsible for waiting for opTimes to be majority committed":"0 ms","Initialize information needed to make a mongod instance shard aware":"0 ms","Start up cluster time keys manager with a local/direct keys client":"0 ms","Start up the replication coordinator":"30 ms","Create an oplog view for tenant migrations":"0 ms","Start transport layer":"0 ms","_initAndListen total elapsed time":"2803 ms"}}}}
{"t":{"$date":"2026-02-04T00:54:13.469+00:00"},"s":"I", "c":"REPL", "id":21331, "ctx":"OplogApplier-0","msg":"Transition to primary complete; database writes are now permitted","attr":{"term":24}}
root@host-mongod-rs01-02:/data# cat pbm.restore.log | grep -i error
{"t":{"$date":"2026-02-04T00:54:10.660+00:00"},"s":"I", "c":"STORAGE", "id":22315, "ctx":"initandlisten","msg":"Opening WiredTiger","attr":{"config":"create,cache_size=126993M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,remove=true,path=journal,compressor=snappy),builtin_extension_config=(zstd=(compression_level=6)),file_manager=(close_idle_time=600,close_scan_interval=10,close_handle_minimum=2000),statistics_log=(wait=0),json_output=(error,message),verbose=[recovery_progress:1,checkpoint_progress:1,compact_progress:1,backup:0,checkpoint:0,compact:0,evict:0,history_store:0,recovery:0,rts:0,salvage:0,tiered:0,timestamp:0,transaction:0,verify:0,log:0],"}}
{"t":{"$date":"2026-02-04T00:54:13.424+00:00"},"s":"I", "c":"-", "id":4939300, "ctx":"monitoring-keys-for-HMAC","msg":"Failed to refresh key cache","attr":{"error":"ReadConcernMajorityNotAvailableYet: Read concern majority reads are currently not possible.","nextWakeupMillis":200}}
{"t":{"$date":"2026-02-04T00:54:13.463+00:00"},"s":"I", "c":"REPL", "id":21392, "ctx":"ReplCoord-0","msg":"New replica set config in use","attr":{"config":{"_id":"rs01","version":1,"members":[{"_id":0,"host":"localhost:27048","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1,"tags":{},"secondaryDelaySecs":0,"votes":1}],"protocolVersion":1,"writeConcernMajorityJournalDefault":true,"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"catchUpTakeoverDelayMillis":30000,"getLastErrorModes":{},"getLastErrorDefaults":{"w":1,"wtimeout":0}}}}}
{"t":{"$date":"2026-02-04T10:31:11.430+00:00"},"s":"I", "c":"CONTROL", "id":23377, "ctx":"SignalHandler","msg":"Received signal","attr":{"signal":15,"error":"Terminated"}}
root@host-mongod-rs01-02:/data# ls -alrth
-rw------- 1 mongod mongod 8 Feb 4 00:54 mongod.lock
-rw------- 1 mongod mongod 341K Feb 4 10:03 pbm.restore.log
working node 1 (host-mongod-rs01-03)
root@host-mongod-rs01-03:~# journalctl -u pbm-agent -f
Feb 04 00:55:41 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:41.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] wait for cluster status
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] no cleanup strategy to apply
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 I [restore/2026-02-03T21:15:20.651401726Z] recovery successfully finished
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 I change stream was closed
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] heartbeats stopped
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/cluster.cleanup.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 D [agentCheckup] deleting agent status
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 I Exit: <nil>
Feb 04 00:55:46 host-mongod-rs01-03.com systemd[1]: pbm-agent.service: Deactivated successfully.
Feb 04 00:55:46 host-mongod-rs01-03.com systemd[1]: pbm-agent.service: Consumed 6h 42min 34.757s CPU time.
root@host-mongod-rs01-03:/data# grep -i recovery pbm.restore.log | grep -i complete
{"t":{"$date":"2026-02-04T00:54:41.217+00:00"},"s":"I", "c":"WTRECOV", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166481,"ts_usec":217986,"thread":"1343693:0x71bf9fedbd40","session_name":"txn-recover","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"recovery was completed successfully and took 363ms, including 301ms for the log replay, 47ms for the rollback to stable, and 14ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:54:43.162+00:00"},"s":"I", "c":"CONTROL", "id":8423403, "ctx":"initandlisten","msg":"mongod startup complete","attr":{"Summary of time elapsed":{"Startup from clean shutdown?":true,"Statistics":{"Transport layer setup":"0 ms","Run initial syncer crash recovery":"0 ms","Create storage engine lock file in the data directory":"0 ms","Get metadata describing storage engine":"0 ms","Validate options in metadata against current startup options":"0 ms","Create storage engine":"2619 ms","Write current PID to file":"0 ms","Initialize FCV before rebuilding indexes":"77 ms","Drop abandoned idents and get back indexes that need to be rebuilt or builds that need to be restarted":"1 ms","Rebuild indexes for collections":"0 ms","Load cluster parameters from disk for a standalone":"0 ms","Build user and roles graph":"0 ms","Verify indexes for admin.system.users collection":"0 ms","Verify indexes for admin.system.roles collection":"0 ms","Set up the background thread pool responsible for waiting for opTimes to be majority committed":"0 ms","Initialize information needed to make a mongod instance shard aware":"0 ms","Start up the replication coordinator":"2 ms","Start transport layer":"1 ms","_initAndListen total elapsed time":"2857 ms"}}}}
{"t":{"$date":"2026-02-04T00:54:45.655+00:00"},"s":"I", "c":"WTRECOV", "id":22430, "ctx":"thread40","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166485,"ts_usec":655560,"thread":"1343693:0x71bf8deb7640","session_name":"WT_CONNECTION.close","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"shutdown was completed successfully and took 112ms, including 0ms for the rollback to stable, and 112ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:54:45.983+00:00"},"s":"I", "c":"WTRECOV", "id":22430, "ctx":"thread40","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166485,"ts_usec":983563,"thread":"1343693:0x71bf8deb7640","session_name":"txn-recover","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"recovery was completed successfully and took 301ms, including 241ms for the log replay, 45ms for the rollback to stable, and 14ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:54:45.995+00:00"},"s":"I", "c":"WTRECOV", "id":22430, "ctx":"thread40","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166485,"ts_usec":995698,"thread":"1343693:0x71bf8deb7640","session_name":"WT_CONNECTION.close","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"shutdown was completed successfully and took 6ms, including 0ms for the rollback to stable, and 6ms for the checkpoint."}}}
root@host-mongod-rs01-03:/data# ls -alrth
-rw------- 1 mongod mongod 0 Feb 4 00:54 mongod.lock
-rw------- 1 mongod mongod 56K Feb 4 00:54 pbm.restore.log
root@host-mongod-rs01-03:/data#
working node 2 (host-mongod-rs01-01)
Feb 04 00:55:46 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] no cleanup strategy to apply
Feb 04 00:55:46 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:46.000+0000 I [restore/2026-02-03T21:15:20.651401726Z] recovery successfully finished
Feb 04 00:55:46 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:46.000+0000 I change stream was closed
Feb 04 00:55:46 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] heartbeats stopped
Feb 04 00:55:46 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/cluster.cleanup.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 00:55:47 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:47.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] cleaning heartbeats stopped
Feb 04 00:55:47 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:47.000+0000 I Exit: <nil>
Feb 04 00:55:47 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:47.000+0000 D [agentCheckup] deleting agent status
Feb 04 00:55:47 host-mongod-rs01-01.com systemd[1]: pbm-agent.service: Deactivated successfully.
Feb 04 00:55:47 host-mongod-rs01-01.com systemd[1]: pbm-agent.service: Consumed 6h
root@host-mongod-rs01-01:/data# grep -i recovery pbm.restore.log | grep -i complete
{"t":{"$date":"2026-02-04T00:48:37.598+00:00"},"s":"I", "c":"WTRECOV", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166117,"ts_usec":598247,"thread":"1434838:0x7c533c0d9d40","session_name":"txn-recover","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"recovery was completed successfully and took 374ms, including 308ms for the log replay, 50ms for the rollback to stable, and 14ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:48:39.553+00:00"},"s":"I", "c":"CONTROL", "id":8423403, "ctx":"initandlisten","msg":"mongod startup complete","attr":{"Summary of time elapsed":{"Startup from clean shutdown?":true,"Statistics":{"Transport layer setup":"0 ms","Run initial syncer crash recovery":"0 ms","Create storage engine lock file in the data directory":"0 ms","Get metadata describing storage engine":"0 ms","Validate options in metadata against current startup options":"0 ms","Create storage engine":"2642 ms","Write current PID to file":"0 ms","Initialize FCV before rebuilding indexes":"78 ms","Drop abandoned idents and get back indexes that need to be rebuilt or builds that need to be restarted":"0 ms","Rebuild indexes for collections":"0 ms","Load cluster parameters from disk for a standalone":"0 ms","Build user and roles graph":"0 ms","Verify indexes for admin.system.users collection":"0 ms","Verify indexes for admin.system.roles collection":"0 ms","Set up the background thread pool responsible for waiting for opTimes to be majority committed":"1 ms","Initialize information needed to make a mongod instance shard aware":"0 ms","Start up the replication coordinator":"1 ms","Start transport layer":"0 ms","_initAndListen total elapsed time":"2881 ms"}}}}
{"t":{"$date":"2026-02-04T00:48:42.022+00:00"},"s":"I", "c":"WTRECOV", "id":22430, "ctx":"thread39","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166122,"ts_usec":22625,"thread":"1434838:0x7c532a1b6640","session_name":"WT_CONNECTION.close","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"shutdown was completed successfully and took 114ms, including 0ms for the rollback to stable, and 113ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:48:42.366+00:00"},"s":"I", "c":"WTRECOV", "id":22430, "ctx":"thread39","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166122,"ts_usec":366939,"thread":"1434838:0x7c532a1b6640","session_name":"txn-recover","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"recovery was completed successfully and took 315ms, including 255ms for the log replay, 45ms for the rollback to stable, and 14ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:48:42.379+00:00"},"s":"I", "c":"WTRECOV", "id":22430, "ctx":"thread39","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166122,"ts_usec":379445,"thread":"1434838:0x7c532a1b6640","session_name":"WT_CONNECTION.close","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"shutdown was completed successfully and took 6ms, including 0ms for the rollback to stable, and 6ms for the checkpoint."}}}
root@host-mongod-rs01-01:/data# ls -alrth
-rw------- 1 mongod mongod 0 Feb 4 00:48 mongod.lock
-rw------- 1 mongod mongod 54K Feb 4 00:48 pbm.restore.log