PITR stuck after full backup

Hey, I’ve got 3 hosts each containing container with mongo, and container of pbm. Full backups are working perfectly but PITR is getting stuck and stops working “randomly” after full backup. Logs looks like this:

Main host (host2):

2026-01-06T07:31:39.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: ...]

2026-01-06T07:31:52.000+0000 D [pitr] expected agents: [host3]; working agents: []

2026-01-06T07:31:52.000+0000 I [pitr] not all ack agents are working, re-configuring pitr members

2026-01-06T07:31:52.000+0000 D [pitr] stop pitr config monitor

2026-01-06T07:31:52.000+0000 D [pitr] stop pitr agent activity monitor

2026-01-06T07:31:52.000+0000 D [pitr] stop pitr topo monitor

2026-01-06T07:31:52.000+0000 D [pitr] stop pitr hb

2026-01-06T07:31:52.000+0000 D [pitr] stop pitr error monitor

2026-01-06T07:31:54.000+0000 D [pitr] waiting for cluster ready status

2026-01-06T07:31:54.000+0000 D [pitr] checking locks in the whole cluster

2026-01-06T07:31:54.000+0000 D [pitr] start pitr hb

2026-01-06T07:31:54.000+0000 D [pitr] start pitr agent activity monitor

2026-01-06T07:31:54.000+0000 D [pitr] start pitr config monitor

2026-01-06T07:31:54.000+0000 D [pitr] start pitr error monitor

2026-01-06T07:31:54.000+0000 D [pitr] start pitr topo monitor

2026-01-06T07:32:09.000+0000 D [pitr] init pitr meta on the first usage

2026-01-06T07:32:09.000+0000 D [pitr] cluster is ready for nomination

2026-01-06T07:32:09.000+0000 D [pitr] reconciling ready status from all agents

2026-01-06T07:32:10.000+0000 D [pitr] waiting pitr nomination

2026-01-06T07:32:11.000+0000 D [pitr] agents in ready: 2; waiting for agents: 3

2026-01-06T07:32:13.000+0000 D [pitr] agents in ready: 2; waiting for agents: 3

same thing about 2 ready for like 600 lines

host3 (blocked one):

2026-01-06T07:30:01.000+0000 I got command backup [name: 2026-01-06T07:30:01Z, compression: s2 (level: default)] <ts: ts1>, opid: opid1

2026-01-06T07:30:01.000+0000 I got epoch {1767615173 758}

2026-01-06T07:30:01.000+0000 I [pitr] got wake_up signal

2026-01-06T07:30:01.000+0000 I [pitr] wake_up for bcp opid1

2026-01-06T07:30:02.000+0000 I got command backup [name: 2026-01-06T07:30:02Z, compression: s2 (level: default)] <ts: ts2>, opid: opid2

2026-01-06T07:30:02.000+0000 I got epoch {1767615173 758}

2026-01-06T07:30:02.000+0000 I [backup/2026-01-06T07:30:01Z] backup started

2026-01-06T07:30:05.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: opid1]

2026-01-06T07:30:05.000+0000 D [pitr] uploading "pbmPitr/projects/20260106/20260106070810-1.20260106073002-36.oplog.s2" [size hint: -1 (unknown); part size: 10485760 (10.00MB)]

2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] wait for tmp users {1767684606 13}

pbm.status():

Cluster:

========

projects:

  - host1 (works fine) [S]: pbm-agent [v2.11.0] OK

  - host2 (main_host) [P]: pbm-agent [v2.11.0] OK

  - host3 (disfunctioning_host) [S]: pbm-agent [v2.11.0] OK





PITR incremental backup:

========================

Status [ON]

! ERROR while running PITR backup: 2026-01-08T09:37:21.000+0000 E [main_host] [pitr] reconciling ready status: timeout while reconciling ready status



Currently running:

==================

(none)


It looks for me like host3 got stuck after started PITR and paused it for full backup. Then after full backup pitr was not unpaused or something and main host tried recovering host3 which didn’t work?
Hope you could help me with this since restarting all containers doesn’t help. It works fine for 3/4 days then get stuck again.

Thanks @sailleacha_sailleach for posting! One of our engineerings will follow-up your question shortly :slight_smile:

Hey, thanks for using PBM and for sharing your issue.

The error that you can see in the output of pbm status command means that PBM cannot restart PITR procedure because one agent didn’t acknowledge within the nomination process. We need to see what’s happening to that agent (in example host3) so can you please share the full logs of that one? Particularly, I am interested to know if the backup has been successfully finished there or it’s stuck on the line:

2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] wait for tmp users {1767684606 13}

Sorry my bad, I’ve wanted to post full logs but it got cut in half:

2026-01-06T06:08:10.000+0000 D [pitr] uploading "pbmPitr/projects/20260106/20260106050810-1.20260106060810-4.oplog.s2" [size hint: -1 (unknown); part size: 10485760 (10.00MB)]

2026-01-06T06:08:11.000+0000 I [pitr] created chunk 2026-01-06T05:08:10 - 2026-01-06T06:08:10. Next chunk creation scheduled to begin at ~2026-01-06 07:08:11.457304901 +0000 UTC m=+73217.487045165

2026-01-06T07:01:01.000+0000 I got command delete <ts: 1767682861>, opid: op1

2026-01-06T07:01:01.000+0000 I got epoch {1767615173 758}

2026-01-06T07:01:01.000+0000 D [delete] get lock: duplicate operation: op1 [Delete]

2026-01-06T07:01:01.000+0000 D [delete] skip: lock not acquired

2026-01-06T07:08:10.000+0000 D [pitr] uploading "pbmPitr/projects/20260106/20260106060810-4.20260106070810-1.oplog.s2" [size hint: -1 (unknown); part size: 10485760 (10.00MB)]

2026-01-06T07:08:11.000+0000 I [pitr] created chunk 2026-01-06T06:08:10 - 2026-01-06T07:08:10. Next chunk creation scheduled to begin at ~2026-01-06 08:08:11.788892736 +0000 UTC m=+76817.818633000

2026-01-06T07:30:01.000+0000 I got command backup [name: 2026-01-06T07:30:01Z, compression: s2 (level: default)] <ts: 1767684601>, opid: op2

2026-01-06T07:30:01.000+0000 I got epoch {1767615173 758}

2026-01-06T07:30:01.000+0000 I [pitr] got wake_up signal

2026-01-06T07:30:01.000+0000 I [pitr] wake_up for bcp op2

2026-01-06T07:30:02.000+0000 I got command backup [name: 2026-01-06T07:30:02Z, compression: s2 (level: default)] <ts: 1767684602>, opid: op3

2026-01-06T07:30:02.000+0000 I got epoch {1767615173 758}

2026-01-06T07:30:02.000+0000 I [backup/2026-01-06T07:30:01Z] backup started

2026-01-06T07:30:05.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]

2026-01-06T07:30:05.000+0000 D [pitr] uploading "pbmPitr/projects/20260106/20260106070810-1.20260106073002-36.oplog.s2" [size hint: -1 (unknown); part size: 10485760 (10.00MB)]

2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] wait for tmp users {1767684606 13}

2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] dumping up to 2 collections in parallel

2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 32768 (32.00KB); part size: 10485760 (10.00MB)]

2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 32768 (32.00KB); part size: 10485760 (10.00MB)]

2026-01-06T07:30:06.000+0000 I [pitr] created chunk 2026-01-06T07:08:10 - 2026-01-06T07:30:02

2026-01-06T07:30:06.000+0000 I [pitr] pausing/stopping with last_ts 2026-01-06 07:30:02 +0000 UTC

2026-01-06T07:30:06.000+0000 I [backup/2026-01-06T07:30:01Z] dump ... done (size: 157)

2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 1003520 (980.00KB); part size: 10485760 (10.00MB)]

2026-01-06T07:30:06.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 1562)

2026-01-06T07:30:06.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 2049216512 (1.91GB); part size: 10485760 (10.00MB)]

2026-01-06T07:30:07.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 2767006)

2026-01-06T07:30:07.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 45056 (44.00KB); part size: 10485760 (10.00MB)]

2026-01-06T07:30:08.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 38227)

2026-01-06T07:30:08.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 5345280 (5.10MB); part size: 10485760 (10.00MB)]

2026-01-06T07:30:11.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 15858336)

2026-01-06T07:30:11.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 165048320 (157.40MB); part size: 10485760 (10.00MB)]

2026-01-06T07:30:17.000+0000 D [backup/2026-01-06T07:30:02Z] nomination timeout

2026-01-06T07:30:17.000+0000 D [backup/2026-01-06T07:30:02Z] skip after nomination, probably started by another node

2026-01-06T07:30:20.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]

2026-01-06T07:30:21.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 554754544)

... dump + uploading only ...

2026-01-06T07:30:26.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection "admin.pbmPITRChunks" done (size: 144726)

2026-01-06T07:30:35.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]

2026-01-06T07:30:50.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]

2026-01-06T07:31:05.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]

2026-01-06T07:31:20.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]

2026-01-06T07:31:35.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: op2]

2026-01-06T07:31:44.000+0000 I [backup/2026-01-06T07:30:01Z] dump collection ... done (size: 7286456354)

2026-01-06T07:31:44.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 0 (0.00B); part size: 10485760 (10.00MB)]

2026-01-06T07:31:45.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 14832 (14.48KB); part size: 10485760 (10.00MB)]

2026-01-06T07:31:45.000+0000 I [backup/2026-01-06T07:30:01Z] dump finished, waiting for the oplog

2026-01-06T07:31:46.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: 1073741824 (1.00GB); part size: 10485760 (10.00MB)]

2026-01-06T07:31:46.000+0000 I [backup/2026-01-06T07:30:01Z] created chunk 2026-01-06T07:30:02 - 2026-01-06T07:31:46

2026-01-06T07:31:46.000+0000 I [backup/2026-01-06T07:30:01Z] dropping tmp collections

2026-01-06T07:31:47.000+0000 D [backup/2026-01-06T07:30:01Z] uploading ... [size hint: -1 (unknown); part size: 10485760 (10.00MB)]

2026-01-06T07:31:49.000+0000 I [backup/2026-01-06T07:30:01Z] backup finished

2026-01-06T07:31:49.000+0000 D [backup/2026-01-06T07:30:01Z] releasing lock

2026-01-07T07:01:02.000+0000 I got command delete <ts: 1767769261>, opid: op4

For me it looks like:

  1. PITR started
  2. Backup started
  3. Backup paused PITR
  4. Backup Finished
  5. PITR got stuck

Thanks for the clarification.

Can you confirm that on problematic agent (host3) nothing was logged between those two log entries?
2026-01-06T07:31:49.000+0000 D [backup/2026-01-06T07:30:01Z] releasing lock
2026-01-07T07:01:02.000+0000 I got command delete <ts: 1767769261>, opid: op4

Beside that, suspicious part are 2 backup commands generated one after another:

2026-01-06T07:30:01.000+0000 I got command backup [name: 2026-01-06T07:30:01Z, compression: s2 (level: default)] <ts: 1767684601>, opid: op2
2026-01-06T07:30:02.000+0000 I got command backup [name: 2026-01-06T07:30:02Z, compression: s2 (level: default)] <ts: 1767684602>, opid: op3

Can you clarify a bit how did you trigger backup command in that particular case?

Hey again,
We were able to reproduce the issue on the PBM’s side:

pbmagent_rs101 | 2026-01-08T17:45:15.000+0000 I [backup/2026-01-08T17:45:08Z] backup finished
pbmagent_rs101 | 2026-01-08T17:45:15.000+0000 D [backup/2026-01-08T17:45:08Z] releasing lock
pbmagent_rs102 | 2026-01-08T17:45:23.000+0000 D [pitr] waiting for cluster ready status
pbmagent_rs100 | 2026-01-08T17:45:24.000+0000 D [pitr] waiting for cluster ready status
pbmagent_rs100 | 2026-01-08T17:45:24.000+0000 D [pitr] checking locks in the whole cluster
pbmagent_rs100 | 2026-01-08T17:45:39.000+0000 D [pitr] init pitr meta on the first usage
pbmagent_rs100 | 2026-01-08T17:45:39.000+0000 D [pitr] cluster is ready for nomination
pbmagent_rs100 | 2026-01-08T17:45:39.000+0000 D [pitr] reconciling ready status from all agents
pbmagent_rs100 | 2026-01-08T17:45:40.000+0000 D [pitr] waiting pitr nomination
pbmagent_rs102 | 2026-01-08T17:45:41.000+0000 D [pitr] waiting pitr nomination
pbmagent_rs100 | 2026-01-08T17:45:41.000+0000 D [pitr] agents in ready: 2; waiting for agents: 3
pbmagent_rs100 | 2026-01-08T17:45:43.000+0000 D [pitr] agents in ready: 2; waiting for agents: 3

It’s possible to reproduce it when pbm-cli generates the backup commands almost simultaneously (within ~1-second difference). After successfully executing the backup, that agent is locked and doesn’t perform any PITR/backup operations, as you correctly wrote above. To make it run again, the agent needs to be restarted, and PITR should continue to work.

A possible workaround is to ensure that two backup commands are not generated at the same time (this can be done only with two pbm-cli clients running in parallel).

I’ve linked the ticket here so you can track the progress of the resolution:

That’s great, thank you very much
About the “How did I manage trigger backup commands” it’s crond jobs on 2 hosts that started backup at the same time,. Not sure but will investigate it on my side