Pbm agent internal cleanup incomplete post restore of physical backup

I performed a pitr restore of a physical backup on three nodes. From all indication it appears the restore completed successfully as i observed the size of /data on the three nodes are equivalent after the restore. The backup was taken with a pbm 2.7.0 and was restored with pbm 2.11.0. (i don’t think this is a problem as backward compatibility is supported per pbm docs).

However, what i observed is the pbm-agent on one of the three nodes (host-mongod-rs01-02) does not capture any log to say the recovery completed successfully. Although the timing of generation mongod.lock on the problematic node aligns with one of the other working node. Also when i checked the pbm.restore.log on the problematic node, i see indications of the recovery completing successfully.

My concern right now is why was the pbm-agent on the problematic node not shutdown or why was the cleanup incomplete as when i checked the logs after the restore it appears restore was still ongoing whiich was a bit misleading until i validated with /data and pbm.restore.log. Could this be a bug as i’m not sure why the behaviour wasn’t consistent with the other nodes. I had to shutdown the pbm-agent manually on this node. I will share logs of this, hope it helps. if this is a bug, please guide on how to create a ticket for this.

pbm version.

root@host-rs01-02:~# pbm status
Cluster:
========
rs01:
  - host-mongod-rs01-01.com:27018 [S]: pbm-agent [v2.11.0] OK
  - host-mongod-rs01-02.com:27018 [P]: pbm-agent [v2.11.0] OK
  - host-mongod-rs01-03.com:27018 [S]: pbm-agent [v2.11.0] OK

problematic node

root@host-mongod-rs01-02:~# journalctl -u pbm-agent -f
Feb 04 09:25:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:25:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/cluster.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:27:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:27:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/rs.rs01/node.host-mongod-rs01-02.com:27018.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:27:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:27:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/rs.rs01/rs.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:27:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:27:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/cluster.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:29:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:29:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/rs.rs01/node.host-mongod-rs01-02.com:27018.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:29:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:29:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/rs.rs01/rs.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:29:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:29:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/cluster.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 09:31:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:31:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/rs.rs01/node.host-mongod-rs01-02.com:27018.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]

root@host-mongod-rs01-02:/data# grep -i complete pbm.restore.log | tail
{"t":{"$date":"2026-02-04T00:54:11.561+00:00"},"s":"I",  "c":"WTRECOV",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166451,"ts_usec":561782,"thread":"1417953:0x7492fd855d40","session_name":"txn-recover","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"recovery was completed successfully and took 354ms, including 293ms for the log replay, 47ms for the rollback to stable, and 14ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:54:13.042+00:00"},"s":"I",  "c":"STORAGE",  "id":7393205, "ctx":"initandlisten","msg":"Collection sampling complete","attr":{"namespace":"local.oplog.rs"}}
{"t":{"$date":"2026-02-04T00:54:13.462+00:00"},"s":"I",  "c":"CONTROL",  "id":8423403, "ctx":"initandlisten","msg":"mongod startup complete","attr":{"Summary of time elapsed":{"Startup from clean shutdown?":true,"Statistics":{"Transport layer setup":"0 ms","Run initial syncer crash recovery":"0 ms","Create storage engine lock file in the data directory":"0 ms","Get metadata describing storage engine":"0 ms","Validate options in metadata against current startup options":"0 ms","Create storage engine":"2522 ms","Write current PID to file":"0 ms","Initialize FCV before rebuilding indexes":"76 ms","Drop abandoned idents and get back indexes that need to be rebuilt or builds that need to be restarted":"1 ms","Rebuild indexes for collections":"0 ms","Build user and roles graph":"0 ms","Verify indexes for admin.system.users collection":"0 ms","Verify indexes for admin.system.roles collection":"0 ms","Set up the background thread pool responsible for waiting for opTimes to be majority committed":"0 ms","Initialize information needed to make a mongod instance shard aware":"0 ms","Start up cluster time keys manager with a local/direct keys client":"0 ms","Start up the replication coordinator":"30 ms","Create an oplog view for tenant migrations":"0 ms","Start transport layer":"0 ms","_initAndListen total elapsed time":"2803 ms"}}}}
{"t":{"$date":"2026-02-04T00:54:13.469+00:00"},"s":"I",  "c":"REPL",     "id":21331,   "ctx":"OplogApplier-0","msg":"Transition to primary complete; database writes are now permitted","attr":{"term":24}}
root@host-mongod-rs01-02:/data# cat pbm.restore.log | grep -i error
{"t":{"$date":"2026-02-04T00:54:10.660+00:00"},"s":"I",  "c":"STORAGE",  "id":22315,   "ctx":"initandlisten","msg":"Opening WiredTiger","attr":{"config":"create,cache_size=126993M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,remove=true,path=journal,compressor=snappy),builtin_extension_config=(zstd=(compression_level=6)),file_manager=(close_idle_time=600,close_scan_interval=10,close_handle_minimum=2000),statistics_log=(wait=0),json_output=(error,message),verbose=[recovery_progress:1,checkpoint_progress:1,compact_progress:1,backup:0,checkpoint:0,compact:0,evict:0,history_store:0,recovery:0,rts:0,salvage:0,tiered:0,timestamp:0,transaction:0,verify:0,log:0],"}}
{"t":{"$date":"2026-02-04T00:54:13.424+00:00"},"s":"I",  "c":"-",        "id":4939300, "ctx":"monitoring-keys-for-HMAC","msg":"Failed to refresh key cache","attr":{"error":"ReadConcernMajorityNotAvailableYet: Read concern majority reads are currently not possible.","nextWakeupMillis":200}}
{"t":{"$date":"2026-02-04T00:54:13.463+00:00"},"s":"I",  "c":"REPL",     "id":21392,   "ctx":"ReplCoord-0","msg":"New replica set config in use","attr":{"config":{"_id":"rs01","version":1,"members":[{"_id":0,"host":"localhost:27048","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1,"tags":{},"secondaryDelaySecs":0,"votes":1}],"protocolVersion":1,"writeConcernMajorityJournalDefault":true,"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"catchUpTakeoverDelayMillis":30000,"getLastErrorModes":{},"getLastErrorDefaults":{"w":1,"wtimeout":0}}}}}
{"t":{"$date":"2026-02-04T10:31:11.430+00:00"},"s":"I",  "c":"CONTROL",  "id":23377,   "ctx":"SignalHandler","msg":"Received signal","attr":{"signal":15,"error":"Terminated"}}
root@host-mongod-rs01-02:/data# ls -alrth
-rw-------  1 mongod mongod    8 Feb  4 00:54 mongod.lock
-rw-------  1 mongod mongod 341K Feb  4 10:03 pbm.restore.log

working node 1 (host-mongod-rs01-03)

root@host-mongod-rs01-03:~# journalctl -u pbm-agent -f
Feb 04 00:55:41 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:41.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] wait for cluster status
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] no cleanup strategy to apply
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 I [restore/2026-02-03T21:15:20.651401726Z] recovery successfully finished
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 I change stream was closed
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] heartbeats stopped
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/cluster.cleanup.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 D [agentCheckup] deleting agent status
Feb 04 00:55:46 host-mongod-rs01-03.com pbm-agent[1332588]: 2026-02-04T00:55:46.000+0000 I Exit: <nil>
Feb 04 00:55:46 host-mongod-rs01-03.com systemd[1]: pbm-agent.service: Deactivated successfully.
Feb 04 00:55:46 host-mongod-rs01-03.com systemd[1]: pbm-agent.service: Consumed 6h 42min 34.757s CPU time.
root@host-mongod-rs01-03:/data# grep -i recovery pbm.restore.log | grep -i complete
{"t":{"$date":"2026-02-04T00:54:41.217+00:00"},"s":"I",  "c":"WTRECOV",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166481,"ts_usec":217986,"thread":"1343693:0x71bf9fedbd40","session_name":"txn-recover","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"recovery was completed successfully and took 363ms, including 301ms for the log replay, 47ms for the rollback to stable, and 14ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:54:43.162+00:00"},"s":"I",  "c":"CONTROL",  "id":8423403, "ctx":"initandlisten","msg":"mongod startup complete","attr":{"Summary of time elapsed":{"Startup from clean shutdown?":true,"Statistics":{"Transport layer setup":"0 ms","Run initial syncer crash recovery":"0 ms","Create storage engine lock file in the data directory":"0 ms","Get metadata describing storage engine":"0 ms","Validate options in metadata against current startup options":"0 ms","Create storage engine":"2619 ms","Write current PID to file":"0 ms","Initialize FCV before rebuilding indexes":"77 ms","Drop abandoned idents and get back indexes that need to be rebuilt or builds that need to be restarted":"1 ms","Rebuild indexes for collections":"0 ms","Load cluster parameters from disk for a standalone":"0 ms","Build user and roles graph":"0 ms","Verify indexes for admin.system.users collection":"0 ms","Verify indexes for admin.system.roles collection":"0 ms","Set up the background thread pool responsible for waiting for opTimes to be majority committed":"0 ms","Initialize information needed to make a mongod instance shard aware":"0 ms","Start up the replication coordinator":"2 ms","Start transport layer":"1 ms","_initAndListen total elapsed time":"2857 ms"}}}}
{"t":{"$date":"2026-02-04T00:54:45.655+00:00"},"s":"I",  "c":"WTRECOV",  "id":22430,   "ctx":"thread40","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166485,"ts_usec":655560,"thread":"1343693:0x71bf8deb7640","session_name":"WT_CONNECTION.close","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"shutdown was completed successfully and took 112ms, including 0ms for the rollback to stable, and 112ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:54:45.983+00:00"},"s":"I",  "c":"WTRECOV",  "id":22430,   "ctx":"thread40","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166485,"ts_usec":983563,"thread":"1343693:0x71bf8deb7640","session_name":"txn-recover","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"recovery was completed successfully and took 301ms, including 241ms for the log replay, 45ms for the rollback to stable, and 14ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:54:45.995+00:00"},"s":"I",  "c":"WTRECOV",  "id":22430,   "ctx":"thread40","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166485,"ts_usec":995698,"thread":"1343693:0x71bf8deb7640","session_name":"WT_CONNECTION.close","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"shutdown was completed successfully and took 6ms, including 0ms for the rollback to stable, and 6ms for the checkpoint."}}}
root@host-mongod-rs01-03:/data# ls -alrth
-rw-------  1 mongod mongod    0 Feb  4 00:54 mongod.lock
-rw-------  1 mongod mongod  56K Feb  4 00:54 pbm.restore.log
root@host-mongod-rs01-03:/data#

working node 2 (host-mongod-rs01-01)

Feb 04 00:55:46 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] no cleanup strategy to apply
Feb 04 00:55:46 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:46.000+0000 I [restore/2026-02-03T21:15:20.651401726Z] recovery successfully finished
Feb 04 00:55:46 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:46.000+0000 I change stream was closed
Feb 04 00:55:46 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] heartbeats stopped
Feb 04 00:55:46 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:46.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading ".pbm.restore/2026-02-03T21:15:20.651401726Z/cluster.cleanup.hb" [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]
Feb 04 00:55:47 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:47.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] cleaning heartbeats stopped
Feb 04 00:55:47 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:47.000+0000 I Exit: <nil>
Feb 04 00:55:47 host-mongod-rs01-01.com pbm-agent[1426289]: 2026-02-04T00:55:47.000+0000 D [agentCheckup] deleting agent status
Feb 04 00:55:47 host-mongod-rs01-01.com systemd[1]: pbm-agent.service: Deactivated successfully.
Feb 04 00:55:47 host-mongod-rs01-01.com systemd[1]: pbm-agent.service: Consumed 6h
root@host-mongod-rs01-01:/data# grep -i recovery pbm.restore.log | grep -i complete
{"t":{"$date":"2026-02-04T00:48:37.598+00:00"},"s":"I",  "c":"WTRECOV",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166117,"ts_usec":598247,"thread":"1434838:0x7c533c0d9d40","session_name":"txn-recover","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"recovery was completed successfully and took 374ms, including 308ms for the log replay, 50ms for the rollback to stable, and 14ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:48:39.553+00:00"},"s":"I",  "c":"CONTROL",  "id":8423403, "ctx":"initandlisten","msg":"mongod startup complete","attr":{"Summary of time elapsed":{"Startup from clean shutdown?":true,"Statistics":{"Transport layer setup":"0 ms","Run initial syncer crash recovery":"0 ms","Create storage engine lock file in the data directory":"0 ms","Get metadata describing storage engine":"0 ms","Validate options in metadata against current startup options":"0 ms","Create storage engine":"2642 ms","Write current PID to file":"0 ms","Initialize FCV before rebuilding indexes":"78 ms","Drop abandoned idents and get back indexes that need to be rebuilt or builds that need to be restarted":"0 ms","Rebuild indexes for collections":"0 ms","Load cluster parameters from disk for a standalone":"0 ms","Build user and roles graph":"0 ms","Verify indexes for admin.system.users collection":"0 ms","Verify indexes for admin.system.roles collection":"0 ms","Set up the background thread pool responsible for waiting for opTimes to be majority committed":"1 ms","Initialize information needed to make a mongod instance shard aware":"0 ms","Start up the replication coordinator":"1 ms","Start transport layer":"0 ms","_initAndListen total elapsed time":"2881 ms"}}}}
{"t":{"$date":"2026-02-04T00:48:42.022+00:00"},"s":"I",  "c":"WTRECOV",  "id":22430,   "ctx":"thread39","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166122,"ts_usec":22625,"thread":"1434838:0x7c532a1b6640","session_name":"WT_CONNECTION.close","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"shutdown was completed successfully and took 114ms, including 0ms for the rollback to stable, and 113ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:48:42.366+00:00"},"s":"I",  "c":"WTRECOV",  "id":22430,   "ctx":"thread39","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166122,"ts_usec":366939,"thread":"1434838:0x7c532a1b6640","session_name":"txn-recover","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"recovery was completed successfully and took 315ms, including 255ms for the log replay, 45ms for the rollback to stable, and 14ms for the checkpoint."}}}
{"t":{"$date":"2026-02-04T00:48:42.379+00:00"},"s":"I",  "c":"WTRECOV",  "id":22430,   "ctx":"thread39","msg":"WiredTiger message","attr":{"message":{"ts_sec":1770166122,"ts_usec":379445,"thread":"1434838:0x7c532a1b6640","session_name":"WT_CONNECTION.close","category":"WT_VERB_RECOVERY_PROGRESS","category_id":30,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"shutdown was completed successfully and took 6ms, including 0ms for the rollback to stable, and 6ms for the checkpoint."}}}
root@host-mongod-rs01-01:/data# ls -alrth
-rw-------  1 mongod mongod    0 Feb  4 00:48 mongod.lock
-rw-------  1 mongod mongod  54K Feb  4 00:48 pbm.restore.log

Hello Daniel, and thanks for describing your case in such detail.

Based on the provided information, it seems you interrupted the physical restore (with PITR), so your restore hasn’t finished. From logs from the “problematic node”, it’s clear that the agent was still in progress:

Feb 04 09:31:21 host-mongod-rs01-02.com pbm-agent[1405510]: 2026-02-04T09:31:21.000+0000 D [restore/2026-02-03T21:15:20.651401726Z] uploading “.pbm.restore/2026-02-03T21:15:20.651401726Z/rs.rs01/node.host-mongod-rs01-02.com:27018.hb” [size hint: 10 (10.00B); part size: 10000000 (9.54MB)]

That agent (host-mongod-rs01-02.com) is Primary in that cluster, which also means that it was applying PITR part of the restore, and that can take time.

To be sure that the physical restore is done, it’s possible to trace it’s progress as it is described here .

In short, there’s special command for that:

pbm describe-restore [restore_name] -c pbm_config.yaml

After the physical restore is complete, all mongod processes, including all agents, should be stopped. Also, describe-restore command should report status: done / partly-done or error.

Please try it again, and if you still have a problem again, share the output of the describe-restore command and all the logs from the problematic node.

@Boris_Ilijic thanks for your response, before terminating the pbm-agent process on the problematic node, i checked the data size on /data on the three nodes and i observed they are all same, i also observed the mongod.lock on the problematic node was generated same time as host-mongod-rs01-03 node, i guess this made me believe the restore was completed..also running documents count on the collections post restore seem to be equivalent, i guess i can try another restore process to indeed validate if what i observed on the problematic node is normal.

Alternatively, if it’s inconvenient for you to start a new restore and you are sure all your data has been restored properly, please share the full PBM logs for that restore from all 3 nodes, and we’ll analyze them. Feel free to obfuscate all sensitive names/data from them.

Unfortunately, without additional info, it’s hard to conclude anything more.