Hey, I’ve got 3 hosts each containing container with mongo, and container of pbm. Full backups are working perfectly but PITR is getting stuck and stops working “randomly” after full backup. Logs looks like this:
Main host (host2):
2026-01-06T07:31:39.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: ...]
2026-01-06T07:31:52.000+0000 D [pitr] expected agents: [host3]; working agents: []
2026-01-06T07:31:52.000+0000 I [pitr] not all ack agents are working, re-configuring pitr members
2026-01-06T07:31:52.000+0000 D [pitr] stop pitr config monitor
2026-01-06T07:31:52.000+0000 D [pitr] stop pitr agent activity monitor
2026-01-06T07:31:52.000+0000 D [pitr] stop pitr topo monitor
2026-01-06T07:31:52.000+0000 D [pitr] stop pitr hb
2026-01-06T07:31:52.000+0000 D [pitr] stop pitr error monitor
2026-01-06T07:31:54.000+0000 D [pitr] waiting for cluster ready status
2026-01-06T07:31:54.000+0000 D [pitr] checking locks in the whole cluster
2026-01-06T07:31:54.000+0000 D [pitr] start pitr hb
2026-01-06T07:31:54.000+0000 D [pitr] start pitr agent activity monitor
2026-01-06T07:31:54.000+0000 D [pitr] start pitr config monitor
2026-01-06T07:31:54.000+0000 D [pitr] start pitr error monitor
2026-01-06T07:31:54.000+0000 D [pitr] start pitr topo monitor
2026-01-06T07:32:09.000+0000 D [pitr] init pitr meta on the first usage
2026-01-06T07:32:09.000+0000 D [pitr] cluster is ready for nomination
2026-01-06T07:32:09.000+0000 D [pitr] reconciling ready status from all agents
2026-01-06T07:32:10.000+0000 D [pitr] waiting pitr nomination
2026-01-06T07:32:11.000+0000 D [pitr] agents in ready: 2; waiting for agents: 3
2026-01-06T07:32:13.000+0000 D [pitr] agents in ready: 2; waiting for agents: 3
same thing about 2 ready for like 600 lines
host3 (blocked one):
2026-01-06T07:30:01.000+0000 I got command backup [name: 2026-01-06T07:30:01Z, compression: s2 (level: default)] <ts: ts1>, opid: opid1
2026-01-06T07:30:01.000+0000 I got epoch {1767615173 758}
2026-01-06T07:30:01.000+0000 I [pitr] got wake_up signal
2026-01-06T07:30:01.000+0000 I [pitr] wake_up for bcp opid1
2026-01-06T07:30:02.000+0000 I got command backup [name: 2026-01-06T07:30:02Z, compression: s2 (level: default)] <ts: ts2>, opid: opid2
2026-01-06T07:30:02.000+0000 I got epoch {1767615173 758}
2026-01-06T07:30:02.000+0000 I [backup/2026-01-06T07:30:01Z] backup started
2026-01-06T07:30:05.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: opid1]
2026-01-06T07:30:05.000+0000 D [pitr] uploading "pbmPitr/projects/20260106/20260106070810-1.20260106073002-36.oplog.s2" [size hint: -1 (unknown); part size: 10485760 (10.00MB)]
2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] wait for tmp users {1767684606 13}
pbm.status():
Cluster:
========
projects:
- host1 (works fine) [S]: pbm-agent [v2.11.0] OK
- host2 (main_host) [P]: pbm-agent [v2.11.0] OK
- host3 (disfunctioning_host) [S]: pbm-agent [v2.11.0] OK
PITR incremental backup:
========================
Status [ON]
! ERROR while running PITR backup: 2026-01-08T09:37:21.000+0000 E [main_host] [pitr] reconciling ready status: timeout while reconciling ready status
Currently running:
==================
(none)
It looks for me like host3 got stuck after started PITR and paused it for full backup. Then after full backup pitr was not unpaused or something and main host tried recovering host3 which didn’t work?
Hope you could help me with this since restarting all containers doesn’t help. It works fine for 3/4 days then get stuck again.
Thanks @sailleacha_sailleach for posting! One of our engineerings will follow-up your question shortly 
Hey, thanks for using PBM and for sharing your issue.
The error that you can see in the output of pbm status command means that PBM cannot restart PITR procedure because one agent didn’t acknowledge within the nomination process. We need to see what’s happening to that agent (in example host3) so can you please share the full logs of that one? Particularly, I am interested to know if the backup has been successfully finished there or it’s stuck on the line:
2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] wait for tmp users {1767684606 13}
Sorry my bad, I’ve wanted to post full logs but it got cut in half:
2026-01-06T06:08:10.000+0000 D [pitr] uploading "pbmPitr/projects/20260106/20260106050810-1.20260106060810-4.oplog.s2" [size hint: -1 (unknown); part size: 10485760 (10.00MB)]
2026-01-06T06:08:11.000+0000 I [pitr] created chunk 2026-01-06T05:08:10 - 2026-01-06T06:08:10. Next chunk creation scheduled to begin at ~2026-01-06 07:08:11.457304901 +0000 UTC m=+73217.487045165
2026-01-06T07:01:01.000+0000 I got command delete <ts: 1767682861>, opid: op1
2026-01-06T07:01:01.000+0000 I got epoch {1767615173 758}
2026-01-06T07:01:01.000+0000 D [delete] get lock: duplicate operation: op1 [Delete]
2026-01-06T07:01:01.000+0000 D [delete] skip: lock not acquired
2026-01-06T07:08:10.000+0000 D [pitr] uploading "pbmPitr/projects/20260106/20260106060810-4.20260106070810-1.oplog.s2" [size hint: -1 (unknown); part size: 10485760 (10.00MB)]
2026-01-06T07:08:11.000+0000 I [pitr] created chunk 2026-01-06T06:08:10 - 2026-01-06T07:08:10. Next chunk creation scheduled to begin at ~2026-01-06 08:08:11.788892736 +0000 UTC m=+76817.818633000
2026-01-06T07:30:01.000+0000 I got command backup [name: 2026-01-06T07:30:01Z, compression: s2 (level: default)] <ts: 1767684601>, opid: op2
2026-01-06T07:30:01.000+0000 I got epoch {1767615173 758}
2026-01-06T07:30:01.000+0000 I [pitr] got wake_up signal
2026-01-06T07:30:01.000+0000 I [pitr] wake_up for bcp op2
2026-01-06T07:30:02.000+0000 I got command backup [name: 2026-01-06T07:30:02Z, compression: s2 (level: default)] <ts: 1767684602>, opid: op3
2026-01-06T07:30:02.000+0000 I got epoch {1767615173 758}
2026-01-06T07:30:02.000+0000 I [backup/2026-01-06T07:30:01Z] backup started
2026-01-06T07:30:05.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]
2026-01-06T07:30:05.000+0000 D [pitr] uploading "pbmPitr/projects/20260106/20260106070810-1.20260106073002-36.oplog.s2" [size hint: -1 (unknown); part size: 10485760 (10.00MB)]
2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] wait for tmp users {1767684606 13}
2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] dumping up to 2 collections in parallel
2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 32768 (32.00KB); part size: 10485760 (10.00MB)]
2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 32768 (32.00KB); part size: 10485760 (10.00MB)]
2026-01-06T07:30:06.000+0000 I [pitr] created chunk 2026-01-06T07:08:10 - 2026-01-06T07:30:02
2026-01-06T07:30:06.000+0000 I [pitr] pausing/stopping with last_ts 2026-01-06 07:30:02 +0000 UTC
2026-01-06T07:30:06.000+0000 I [backup/2026-01-06T07:30:01Z] dump ... done (size: 157)
2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 1003520 (980.00KB); part size: 10485760 (10.00MB)]
2026-01-06T07:30:06.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 1562)
2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 2049216512 (1.91GB); part size: 10485760 (10.00MB)]
2026-01-06T07:30:07.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 2767006)
2026-01-06T07:30:07.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 45056 (44.00KB); part size: 10485760 (10.00MB)]
2026-01-06T07:30:08.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 38227)
2026-01-06T07:30:08.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 5345280 (5.10MB); part size: 10485760 (10.00MB)]
2026-01-06T07:30:11.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 15858336)
2026-01-06T07:30:11.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 165048320 (157.40MB); part size: 10485760 (10.00MB)]
2026-01-06T07:30:17.000+0000 D [backup/2026-01-06T07:30:02Z] nomination timeout
2026-01-06T07:30:17.000+0000 D [backup/2026-01-06T07:30:02Z] skip after nomination, probably started by another node
2026-01-06T07:30:20.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]
2026-01-06T07:30:21.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 554754544)
... dump + uploading only ...
2026-01-06T07:30:26.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection "admin.pbmPITRChunks" done (size: 144726)
2026-01-06T07:30:35.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]
2026-01-06T07:30:50.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]
2026-01-06T07:31:05.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]
2026-01-06T07:31:20.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]
2026-01-06T07:31:35.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]
2026-01-06T07:31:44.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 7286456354)
2026-01-06T07:31:44.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 0 (0.00B); part size: 10485760 (10.00MB)]
2026-01-06T07:31:45.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 14832 (14.48KB); part size: 10485760 (10.00MB)]
2026-01-06T07:31:45.000+0000 I [backup/2026-01-06T07:30:01Z] dump finished, waiting for the oplog
2026-01-06T07:31:46.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 1073741824 (1.00GB); part size: 10485760 (10.00MB)]
2026-01-06T07:31:46.000+0000 I [backup/2026-01-06T07:30:01Z] created chunk 2026-01-06T07:30:02 - 2026-01-06T07:31:46
2026-01-06T07:31:46.000+0000 I [backup/2026-01-06T07:30:01Z] dropping tmp collections
2026-01-06T07:31:47.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: -1 (unknown); part size: 10485760 (10.00MB)]
2026-01-06T07:31:49.000+0000 I [backup/2026-01-06T07:30:01Z] backup finished
2026-01-06T07:31:49.000+0000 D [backup/2026-01-06T07:30:01Z] releasing lock
2026-01-07T07:01:02.000+0000 I got command delete <ts: 1767769261>, opid: op4
For me it looks like:
- PITR started
- Backup started
- Backup paused PITR
- Backup Finished
- PITR got stuck
Thanks for the clarification.
Can you confirm that on problematic agent (host3) nothing was logged between those two log entries?
2026-01-06T07:31:49.000+0000 D [backup/2026-01-06T07:30:01Z] releasing lock
2026-01-07T07:01:02.000+0000 I got command delete <ts: 1767769261>, opid: op4
Beside that, suspicious part are 2 backup commands generated one after another:
2026-01-06T07:30:01.000+0000 I got command backup [name: 2026-01-06T07:30:01Z, compression: s2 (level: default)] <ts: 1767684601>, opid: op2
2026-01-06T07:30:02.000+0000 I got command backup [name: 2026-01-06T07:30:02Z, compression: s2 (level: default)] <ts: 1767684602>, opid: op3
Can you clarify a bit how did you trigger backup command in that particular case?
Hey again,
We were able to reproduce the issue on the PBM’s side:
pbmagent_rs101 | 2026-01-08T17:45:15.000+0000 I [backup/2026-01-08T17:45:08Z] backup finished
pbmagent_rs101 | 2026-01-08T17:45:15.000+0000 D [backup/2026-01-08T17:45:08Z] releasing lock
pbmagent_rs102 | 2026-01-08T17:45:23.000+0000 D [pitr] waiting for cluster ready status
pbmagent_rs100 | 2026-01-08T17:45:24.000+0000 D [pitr] waiting for cluster ready status
pbmagent_rs100 | 2026-01-08T17:45:24.000+0000 D [pitr] checking locks in the whole cluster
pbmagent_rs100 | 2026-01-08T17:45:39.000+0000 D [pitr] init pitr meta on the first usage
pbmagent_rs100 | 2026-01-08T17:45:39.000+0000 D [pitr] cluster is ready for nomination
pbmagent_rs100 | 2026-01-08T17:45:39.000+0000 D [pitr] reconciling ready status from all agents
pbmagent_rs100 | 2026-01-08T17:45:40.000+0000 D [pitr] waiting pitr nomination
pbmagent_rs102 | 2026-01-08T17:45:41.000+0000 D [pitr] waiting pitr nomination
pbmagent_rs100 | 2026-01-08T17:45:41.000+0000 D [pitr] agents in ready: 2; waiting for agents: 3
pbmagent_rs100 | 2026-01-08T17:45:43.000+0000 D [pitr] agents in ready: 2; waiting for agents: 3
…
It’s possible to reproduce it when pbm-cli generates the backup commands almost simultaneously (within ~1-second difference). After successfully executing the backup, that agent is locked and doesn’t perform any PITR/backup operations, as you correctly wrote above. To make it run again, the agent needs to be restarted, and PITR should continue to work.
A possible workaround is to ensure that two backup commands are not generated at the same time (this can be done only with two pbm-cli clients running in parallel).
I’ve linked the ticket here so you can track the progress of the resolution:
That’s great, thank you very much
About the “How did I manage trigger backup commands” it’s crond jobs on 2 hosts that started backup at the same time,. Not sure but will investigate it on my side