Another convergence Timeout

I’m trying to figure out why I’m getting a convergence timeout…

Backing up a sharded MongoDB 4.4 to S3.

# pbm status
Cluster:
========
www-config:
  - www-config/172.26.96.87:27017 [S]: pbm-agent v2.5.0 OK
  - www-config/172.26.96.176:27017 [S]: pbm-agent v2.5.0 OK
  - www-config/172.26.97.76:27017 [P]: pbm-agent v2.5.0 OK
rs-wwww-shard-001:
  - rs-wwww-shard-001/172.26.97.51:27017 [S]: pbm-agent v2.5.0 OK
  - rs-wwww-shard-001/172.26.96.92:27017 [S]: pbm-agent v2.5.0 OK
  - rs-wwww-shard-001/172.26.96.133:27017 [P]: pbm-agent v2.5.0 OK
rs-wwww-shard-002:
  - rs-wwww-shard-002/172.26.96.31:27017 [S]: pbm-agent v2.5.0 OK
  - rs-wwww-shard-002/172.26.97.40:27017 [S]: pbm-agent v2.5.0 OK
  - rs-wwww-shard-002/172.26.96.253:27017 [P]: pbm-agent v2.5.0 OK
rs-wwww-shard-000:
  - rs-wwww-shard-000/172.26.97.42:27017 [P]: pbm-agent v2.5.0 OK
  - rs-wwww-shard-000/172.26.96.190:27017 [S]: pbm-agent v2.5.0 OK
  - rs-wwww-shard-000/172.26.96.44:27017 [S]: pbm-agent v2.5.0 OK
S3 eu-central-1 s3://https://s3.dualstack.eu-central-1.amazonaws.com/removed/mongodb/www-shard
  Snapshots:
    2024-07-24T22:00:00Z 0.00B <logical> [ERROR: couldn't get response from all shards: convergeClusterWithTimeout: 5m0s: reached converge timeout] [2024-07-24T22:05:03Z]
    2024-07-16T22:00:09Z 313.73GB <logical> [restore_to_time: 2024-07-17T01:37:39Z]

(backups are scheduled to run daily, but was failing until a PBM agent restart on all nodes yesterday)

Something else: The status still shows this:

PITR incremental backup:
========================
Status [ON]
! ERROR while running PITR backup: 2024-07-24T23:59:50.000+0200 E [rs-wwww-shard-000/172.26.96.190:27017] [pitr] streaming oplog: oplog has insufficient range, some records since the last saved ts {1721180259 1} are missing. Run `pbm backup` to create a valid starting point for the PITR; 2024-07-24T23:59:59.000+0200 E [rs-wwww-shard-001/172.26.96.133:27017] [pitr] streaming oplog: oplog has insufficient range, some records since the last saved ts {1721180259 1} are missing. Run `pbm backup` to create a valid starting point for the PITR; 2024-07-24T23:59:56.000+0200 E [rs-wwww-shard-002/172.26.96.31:27017] [pitr] streaming oplog: oplog has insufficient range, some records since the last saved ts {1721180259 1} are missing. Run `pbm backup` to create a valid starting point for the PITR

Currently running:
==================
Snapshot backup "2024-07-24T22:00:00Z", started at 2024-07-24T22:00:00Z. Status: error. [op id: 66a1796038f60765b473e6aa]

The lock does not seem to be released in the case of the convergence timeout? (The PITR complaining about the lock every 15s makes the log from the start of the backup hard to find in the pbm logs output) (-t 10000 is not enough entries, -t 100000 does not seem to return in a reasonable time, 50000 seems to work)

pbm logs output around the start of the backup:

2024-07-24T22:00:00Z I [rs-wwww-shard-000/172.26.96.44:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:00Z I [rs-wwww-shard-001/172.26.96.133:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:00Z I [rs-wwww-shard-000/172.26.96.44:27017] got epoch {1721734315 330}
2024-07-24T22:00:00Z I [rs-wwww-shard-001/172.26.96.133:27017] got epoch {1721734315 330}
2024-07-24T22:00:00Z I [www-config/172.26.96.87:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:00Z I [www-config/172.26.96.87:27017] got epoch {1721734315 330}
2024-07-24T22:00:00Z I [www-config/172.26.97.76:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:00Z I [www-config/172.26.97.76:27017] got epoch {1721734315 330}
2024-07-24T22:00:00Z I [rs-wwww-shard-002/172.26.96.253:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:00Z I [www-config/172.26.96.176:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:00Z I [rs-wwww-shard-002/172.26.96.253:27017] got epoch {1721734315 330}
2024-07-24T22:00:00Z I [www-config/172.26.96.176:27017] got epoch {1721734315 330}
2024-07-24T22:00:00Z I [www-config/172.26.96.176:27017] [pitr] got wake_up signal
2024-07-24T22:00:00Z I [www-config/172.26.96.176:27017] [pitr] wake_up for bcp 66a1796038f60765b473e6aa
2024-07-24T22:00:00Z I [rs-wwww-shard-000/172.26.96.190:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:00Z I [rs-wwww-shard-000/172.26.96.190:27017] got epoch {1721734315 330}
2024-07-24T22:00:01Z I [rs-wwww-shard-002/172.26.96.31:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:01Z I [rs-wwww-shard-002/172.26.96.31:27017] got epoch {1721734315 330}
2024-07-24T22:00:01Z I [rs-wwww-shard-002/172.26.97.40:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:01Z I [rs-wwww-shard-002/172.26.97.40:27017] got epoch {1721734315 330}
2024-07-24T22:00:01Z I [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] backup started
2024-07-24T22:00:01Z I [rs-wwww-shard-000/172.26.97.42:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:01Z I [rs-wwww-shard-000/172.26.97.42:27017] got epoch {1721734315 330}
2024-07-24T22:00:01Z I [rs-wwww-shard-001/172.26.97.51:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:01Z I [rs-wwww-shard-001/172.26.97.51:27017] got epoch {1721734315 330}
2024-07-24T22:00:01Z I [rs-wwww-shard-001/172.26.96.92:27017] got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-24T22:00:01Z I [rs-wwww-shard-001/172.26.96.92:27017] got epoch {1721734315 330}
2024-07-24T22:00:03Z I [rs-wwww-shard-000/172.26.97.42:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a1796038f60765b473e6aa]
2024-07-24T22:00:04Z I [www-config/172.26.96.176:27017] [pitr] created chunk 2024-07-24T21:55:09 - 2024-07-24T22:00:01
2024-07-24T22:00:04Z I [www-config/172.26.96.176:27017] [pitr] pausing/stopping with last_ts 2024-07-24 22:00:01 +0000 UTC
2024-07-24T22:00:05Z I [rs-wwww-shard-000/172.26.96.190:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a1796038f60765b473e6aa]
<snip several similar messages, for all shards>
2024-07-24T22:04:59Z I [www-config/172.26.97.76:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a1796038f60765b473e6aa]
2024-07-24T22:05:02Z I [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] mark RS as error `couldn't get response from all shards: convergeClusterWithTimeout: 5m0s: reached converge timeout`: <nil>
2024-07-24T22:05:03Z I [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] mark backup as error `couldn't get response from all shards: convergeClusterWithTimeout: 5m0s: reached converge timeout`: <nil>
2024-07-24T22:05:03Z E [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] backup: couldn't get response from all shards: convergeClusterWithTimeout: 5m0s: reached converge timeout
2024-07-24T22:05:03Z I [rs-wwww-shard-000/172.26.97.42:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a1796038f60765b473e6aa]
2024-07-24T22:05:05Z I [rs-wwww-shard-000/172.26.96.190:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a1796038f60765b473e6aa]

Journal on selected configsvr node:

2024-07-25T00:00:00.000+0200 I got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-25T00:00:00.000+0200 I got epoch {1721734315 330}
2024-07-25T00:00:01.000+0200 I [backup/2024-07-24T22:00:00Z] backup started
2024-07-25T00:00:01.000+0200 D [backup/2024-07-24T22:00:00Z] waiting for balancer off
2024-07-25T00:00:01.000+0200 D [backup/2024-07-24T22:00:00Z] balancer status: off
2024-07-25T00:00:08.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a1796038f60765b473e6aa]
<snip PITR messages>
2024-07-25T00:04:54.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a1796038f60765b473e6aa]
2024-07-25T00:05:02.000+0200 I [backup/2024-07-24T22:00:00Z] mark RS as error `couldn't get response from all shards: convergeClusterW>
2024-07-25T00:05:03.000+0200 I [backup/2024-07-24T22:00:00Z] mark backup as error `couldn't get response from all shards: convergeClus>
2024-07-25T00:05:03.000+0200 D [backup/2024-07-24T22:00:00Z] set balancer on
2024-07-25T00:05:03.000+0200 E [backup/2024-07-24T22:00:00Z] backup: couldn't get response from all shards: convergeClusterWithTimeout>
2024-07-25T00:05:03.000+0200 D [backup/2024-07-24T22:00:00Z] releasing lock
2024-07-25T00:05:09.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a1796038f60765b473e6aa]

Journal on non-selected (PRIMARY) configsvr node: (some trimming by less involved)

2024-07-25T00:00:00.000+0200 I got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-25T00:00:00.000+0200 I got epoch {1721734315 330}
2024-07-25T00:00:00.000+0200 D [backup/2024-07-24T22:00:00Z] init backup meta
2024-07-25T00:00:00.000+0200 D [backup/2024-07-24T22:00:00Z] nomination list for rs-wwww-shard-000: [[172.26.96.190:27017 172.26.96.44>
2024-07-25T00:00:00.000+0200 D [backup/2024-07-24T22:00:00Z] nomination list for www-config: [[172.26.96.87:27017 172.26.96.176:27017]>
2024-07-25T00:00:00.000+0200 D [backup/2024-07-24T22:00:00Z] nomination list for rs-wwww-shard-001: [[172.26.97.51:27017 172.26.96.92:>
2024-07-25T00:00:00.000+0200 D [backup/2024-07-24T22:00:00Z] nomination list for rs-wwww-shard-002: [[172.26.96.31:27017 172.26.97.40:>
2024-07-25T00:00:00.000+0200 D [backup/2024-07-24T22:00:00Z] nomination rs-wwww-shard-000, set candidates [172.26.96.190:27017 172.26.>
2024-07-25T00:00:00.000+0200 D [backup/2024-07-24T22:00:00Z] nomination rs-wwww-shard-001, set candidates [172.26.97.51:27017 172.26.9>
2024-07-25T00:00:00.000+0200 D [backup/2024-07-24T22:00:00Z] nomination rs-wwww-shard-002, set candidates [172.26.96.31:27017 172.26.9>
2024-07-25T00:00:00.000+0200 D [backup/2024-07-24T22:00:00Z] nomination www-config, set candidates [172.26.96.87:27017 172.26.96.176:2>
2024-07-25T00:00:05.000+0200 D [backup/2024-07-24T22:00:00Z] bcp nomination: rs-wwww-shard-001 won by 172.26.97.51:27017
2024-07-25T00:00:05.000+0200 D [backup/2024-07-24T22:00:00Z] bcp nomination: rs-wwww-shard-000 won by 172.26.96.44:27017
2024-07-25T00:00:05.000+0200 D [backup/2024-07-24T22:00:00Z] bcp nomination: rs-wwww-shard-002 won by 172.26.96.31:27017
2024-07-25T00:00:05.000+0200 D [backup/2024-07-24T22:00:00Z] bcp nomination: www-config won by 172.26.96.87:27017
2024-07-25T00:00:06.000+0200 D [backup/2024-07-24T22:00:00Z] skip after nomination, probably started by another node
2024-07-25T00:00:14.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a1796038f60765b473e6aa]

The journal for all the shards looks something like this on the nodes that got selected:

2024-07-25T00:00:00.000+0200 I got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-25T00:00:00.000+0200 I got epoch {1721734315 330}
2024-07-25T00:00:09.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a1796038f60765b473e6aa]

And like this on the ones that did not get selected:

2024-07-25T00:00:01.000+0200 I got command backup [name: 2024-07-24T22:00:00Z, compression: zstd (level: default)] <ts: 1721858400>
2024-07-25T00:00:01.000+0200 I got epoch {1721734315 330}
2024-07-25T00:00:02.000+0200 D [backup/2024-07-24T22:00:00Z] skip after nomination, probably started by another node
2024-07-25T00:00:03.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a1796038f60765b473e6aa]

How do I go about resolving this?

Is there a way to get the locks released automatically if the backup fails?

Is there a way to see which shard it is waiting for? (and what is the “convergence” that it is waiting for?) (I did take a look at the code and it needs more digging)

(I did increase the timeout to try and resolve this, which is why it waits 5 minutes)

Edit:
It does seem like there might have been an old lock in the DB when that backup started: (that timestamp is on the 23rd)

www-config:PRIMARY> db.pbmLock.find()
{ "_id" : ObjectId("66a17961fb6c6fa3e6bff3d5"), "type" : "backup", "replset" : "rs-wwww-shard-000", "node" : "172.26.96.44:27017", "opid" : "66a1796038f60765b473e6aa", "epoch" : Timestamp(1721734315, 330), "hb" : Timestamp(1721912301, 3) }
{ "_id" : ObjectId("66a179614e2d5c602af61e6e"), "type" : "backup", "replset" : "rs-wwww-shard-002", "node" : "172.26.96.31:27017", "opid" : "66a1796038f60765b473e6aa", "epoch" : Timestamp(1721734315, 330), "hb" : Timestamp(1721912286, 518) }
{ "_id" : ObjectId("66a1796299926193d225b081"), "type" : "backup", "replset" : "rs-wwww-shard-001", "node" : "172.26.97.51:27017", "opid" : "66a1796038f60765b473e6aa", "epoch" : Timestamp(1721734315, 330), "hb" : Timestamp(1721912286, 520) }
www-config:PRIMARY>  db.pbmLock.deleteMany({})
{ "acknowledged" : true, "deletedCount" : 3 }

(It had entries for all shards, but not for the configsvr replicaset) (Started a backup again, it still fails with the same error): (pbm logs output for this one)

024-07-25T13:02:53Z I [rs-wwww-shard-000/172.26.96.44:27017] [pitr] streaming started from 2024-07-17 01:37:39 +0000 UTC / 1721180259
2024-07-25T13:02:53Z E [rs-wwww-shard-000/172.26.96.44:27017] [pitr] streaming oplog: oplog has insufficient range, some records since the last saved ts {1721180259 1} are missing. Run `pbm backup` to create a valid starting point for the PITR
2024-07-25T13:03:10Z I [www-config/172.26.96.176:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:10Z I [rs-wwww-shard-000/172.26.96.190:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:10Z I [rs-wwww-shard-000/172.26.96.190:27017] got epoch {1721734315 330}
2024-07-25T13:03:10Z I [www-config/172.26.96.176:27017] got epoch {1721734315 330}
2024-07-25T13:03:10Z I [rs-wwww-shard-000/172.26.96.44:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:10Z I [rs-wwww-shard-000/172.26.96.44:27017] got epoch {1721734315 330}
2024-07-25T13:03:10Z I [rs-wwww-shard-000/172.26.97.42:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:10Z I [rs-wwww-shard-000/172.26.97.42:27017] got epoch {1721734315 330}
2024-07-25T13:03:10Z I [rs-wwww-shard-002/172.26.97.40:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:10Z I [rs-wwww-shard-002/172.26.97.40:27017] got epoch {1721734315 330}
2024-07-25T13:03:10Z I [rs-wwww-shard-002/172.26.96.253:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:10Z I [rs-wwww-shard-002/172.26.96.253:27017] got epoch {1721734315 330}
2024-07-25T13:03:10Z I [rs-wwww-shard-001/172.26.96.133:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:10Z I [rs-wwww-shard-001/172.26.96.133:27017] got epoch {1721734315 330}
2024-07-25T13:03:10Z I [rs-wwww-shard-002/172.26.96.31:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:10Z I [rs-wwww-shard-002/172.26.96.31:27017] got epoch {1721734315 330}
2024-07-25T13:03:10Z I [rs-wwww-shard-001/172.26.97.51:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:10Z I [rs-wwww-shard-001/172.26.97.51:27017] got epoch {1721734315 330}
2024-07-25T13:03:11Z I [rs-wwww-shard-001/172.26.96.92:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:11Z I [rs-wwww-shard-001/172.26.96.92:27017] got epoch {1721734315 330}
2024-07-25T13:03:11Z I [www-config/172.26.96.87:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:11Z I [www-config/172.26.96.87:27017] got epoch {1721734315 330}
2024-07-25T13:03:11Z I [www-config/172.26.97.76:27017] got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
2024-07-25T13:03:11Z I [www-config/172.26.97.76:27017] got epoch {1721734315 330}
2024-07-25T13:03:11Z I [www-config/172.26.97.76:27017] [pitr] got wake_up signal
2024-07-25T13:03:11Z I [www-config/172.26.97.76:27017] [pitr] wake_up for bcp 66a24d0e1e4eba371d786497
2024-07-25T13:03:11Z I [rs-wwww-shard-000/172.26.97.42:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:11Z I [rs-wwww-shard-002/172.26.97.40:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:11Z I [rs-wwww-shard-002/172.26.96.253:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:11Z I [rs-wwww-shard-000/172.26.96.190:27017] [backup/2024-07-25T13:03:10Z] backup started
2024-07-25T13:03:11Z I [www-config/172.26.96.176:27017] [backup/2024-07-25T13:03:10Z] backup started
2024-07-25T13:03:11Z I [rs-wwww-shard-002/172.26.96.31:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:11Z I [rs-wwww-shard-001/172.26.97.51:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:14Z I [www-config/172.26.97.76:27017] [pitr] created chunk 2024-07-24T22:00:01 - 2024-07-25T13:03:11
2024-07-25T13:03:14Z I [www-config/172.26.97.76:27017] [pitr] pausing/stopping with last_ts 2024-07-25 13:03:11 +0000 UTC
2024-07-25T13:03:15Z I [rs-wwww-shard-001/172.26.96.133:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:20Z I [www-config/172.26.97.76:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:23Z I [rs-wwww-shard-000/172.26.96.190:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:23Z I [rs-wwww-shard-000/172.26.96.44:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:23Z I [rs-wwww-shard-001/172.26.96.92:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:23Z I [www-config/172.26.96.87:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:24Z I [www-config/172.26.96.176:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
2024-07-25T13:03:26Z I [rs-wwww-shard-000/172.26.97.42:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
<snip lock messages>
2024-07-25T13:08:13Z I [www-config/172.26.96.176:27017] [backup/2024-07-25T13:03:10Z] mark RS as error `couldn't get response from all shards: convergeClusterWithTimeout: 5m0s: reached converge timeout`: <nil>
2024-07-25T13:08:13Z I [www-config/172.26.96.176:27017] [backup/2024-07-25T13:03:10Z] mark backup as error `couldn't get response from all shards: convergeClusterWithTimeout: 5m0s: reached converge timeout`: <nil>
2024-07-25T13:08:13Z E [www-config/172.26.96.176:27017] [backup/2024-07-25T13:03:10Z] backup: couldn't get response from all shards: convergeClusterWithTimeout: 5m0s: reached converge timeout
2024-07-25T13:08:13Z I [rs-wwww-shard-000/172.26.96.190:27017] [backup/2024-07-25T13:03:10Z] mark RS as error `waiting for running: cluster failed: <nil>`: <nil>
2024-07-25T13:08:13Z E [rs-wwww-shard-000/172.26.96.190:27017] [backup/2024-07-25T13:03:10Z] backup: waiting for running: cluster failed: <nil>
2024-07-25T13:08:15Z I [rs-wwww-shard-001/172.26.96.133:27017] [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]

(I only see “Backup started” messages for the configsvr and one shard, which might be the issue, but I have no idea why…)

On one of the (likely chosen nodes), this is the only entries in the pbm-agent journal around that time:

Jul 25 15:03:10 mongo-sh-002b.www.commander.io pbm-agent[340878]: 2024-07-25T15:03:10.000+0200 I got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
Jul 25 15:03:10 mongo-sh-002b.www.commander.io pbm-agent[340878]: 2024-07-25T15:03:10.000+0200 I got epoch {1721734315 330}
Jul 25 15:03:11 mongo-sh-002b.www.commander.io pbm-agent[340878]: 2024-07-25T15:03:11.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
<those keep repeating>

On the shard that had “backup started”: (chosen node)

Jul 25 15:03:10 mongo-sh-000c.www.commander.io pbm-agent[453003]: 2024-07-25T15:03:10.000+0200 I got command backup [name: 2024-07-25T13:03:10Z, compression: zstd (level: default)] <ts: 1721912590>
Jul 25 15:03:10 mongo-sh-000c.www.commander.io pbm-agent[453003]: 2024-07-25T15:03:10.000+0200 I got epoch {1721734315 330}
Jul 25 15:03:11 mongo-sh-000c.www.commander.io pbm-agent[453003]: 2024-07-25T15:03:11.000+0200 I [backup/2024-07-25T13:03:10Z] backup started
Jul 25 15:03:23 mongo-sh-000c.www.commander.io pbm-agent[453003]: 2024-07-25T15:03:23.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]
<delete PITR message repeating>
Jul 25 15:08:13 mongo-sh-000c.www.commander.io pbm-agent[453003]: 2024-07-25T15:08:13.000+0200 I [backup/2024-07-25T13:03:10Z] mark RS as error `waiting for running: cluster failed: <nil>`: <nil>
Jul 25 15:08:13 mongo-sh-000c.www.commander.io pbm-agent[453003]: 2024-07-25T15:08:13.000+0200 D [backup/2024-07-25T13:03:10Z] set balancer on
Jul 25 15:08:13 mongo-sh-000c.www.commander.io pbm-agent[453003]: 2024-07-25T15:08:13.000+0200 E [backup/2024-07-25T13:03:10Z] backup: waiting for running: cluster failed: <nil>
Jul 25 15:08:13 mongo-sh-000c.www.commander.io pbm-agent[453003]: 2024-07-25T15:08:13.000+0200 D [backup/2024-07-25T13:03:10Z] releasing lock
Jul 25 15:08:23 mongo-sh-000c.www.commander.io pbm-agent[453003]: 2024-07-25T15:08:23.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66a24d0e1e4eba371d786497]

Hi, thank you for all the information that you provided.

For the convergence timeout issue, it would be necessary to have all logs related to problematic backup to comment on anything else. In your case, you can fetch all of them with the command (all pitr logs will be filtered out):
pbm logs -e backup/2024-07-24T22:00:00Z -s D -t 0

Additionally, it would be great to share all log entries that happened near: Tue Jul 23 11:31:55 UTC 2024 . That is the date when problematic locks are created.

Currently, PBM has an issue with stale locks and the open bug can be seen here: [PBM-1326] - Percona JIRA

Thank you for your contribution.

The first set of logs: (from pbm logs -e backup/2024-07-24T22:00:00Z -s D -t 0)

root@mongo-sh-config-b ~ # pbm logs -e backup/2024-07-24T22:00:00Z -s D -t 0
2024-07-24T22:00:00Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] init backup meta
2024-07-24T22:00:00Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] nomination list for www-config: [[172.26.96.87:27017 172.26.96.176:27017] [172.26.97.76:27017]]
2024-07-24T22:00:00Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] nomination list for rs-wwww-shard-000: [[172.26.96.190:27017 172.26.96.44:27017] [172.26.97.42:27017]]
2024-07-24T22:00:00Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] nomination list for rs-wwww-shard-001: [[172.26.97.51:27017 172.26.96.92:27017] [172.26.96.133:27017]]
2024-07-24T22:00:00Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] nomination list for rs-wwww-shard-002: [[172.26.96.31:27017 172.26.97.40:27017] [172.26.96.253:27017]]
2024-07-24T22:00:00Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] nomination rs-wwww-shard-000, set candidates [172.26.96.190:27017 172.26.96.44:27017]
2024-07-24T22:00:00Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] nomination rs-wwww-shard-001, set candidates [172.26.97.51:27017 172.26.96.92:27017]
2024-07-24T22:00:00Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] nomination rs-wwww-shard-002, set candidates [172.26.96.31:27017 172.26.97.40:27017]
2024-07-24T22:00:00Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] nomination www-config, set candidates [172.26.96.87:27017 172.26.96.176:27017]
2024-07-24T22:00:01Z I [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] backup started
2024-07-24T22:00:01Z D [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] waiting for balancer off
2024-07-24T22:00:01Z D [www-config/172.26.96.176:27017] [backup/2024-07-24T22:00:00Z] skip after nomination, probably started by another node
2024-07-24T22:00:01Z D [rs-wwww-shard-000/172.26.96.190:27017] [backup/2024-07-24T22:00:00Z] skip after nomination, probably started by another node
2024-07-24T22:00:01Z D [rs-wwww-shard-002/172.26.97.40:27017] [backup/2024-07-24T22:00:00Z] skip after nomination, probably started by another node
2024-07-24T22:00:01Z D [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] balancer status: off
2024-07-24T22:00:01Z D [rs-wwww-shard-002/172.26.96.253:27017] [backup/2024-07-24T22:00:00Z] skip after nomination, probably started by another node
2024-07-24T22:00:02Z D [rs-wwww-shard-000/172.26.97.42:27017] [backup/2024-07-24T22:00:00Z] skip after nomination, probably started by another node
2024-07-24T22:00:02Z D [rs-wwww-shard-001/172.26.96.92:27017] [backup/2024-07-24T22:00:00Z] skip after nomination, probably started by another node
2024-07-24T22:00:02Z D [rs-wwww-shard-001/172.26.96.133:27017] [backup/2024-07-24T22:00:00Z] skip after nomination, probably started by another node
2024-07-24T22:00:05Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] bcp nomination: rs-wwww-shard-001 won by 172.26.97.51:27017
2024-07-24T22:00:05Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] bcp nomination: rs-wwww-shard-000 won by 172.26.96.44:27017
2024-07-24T22:00:05Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] bcp nomination: rs-wwww-shard-002 won by 172.26.96.31:27017
2024-07-24T22:00:05Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] bcp nomination: www-config won by 172.26.96.87:27017
2024-07-24T22:00:06Z D [www-config/172.26.97.76:27017] [backup/2024-07-24T22:00:00Z] skip after nomination, probably started by another node
2024-07-24T22:05:02Z I [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] mark RS as error `couldn't get response from all shards: convergeClusterWithTimeout: 5m0s: reached converge timeout`: <nil>
2024-07-24T22:05:03Z I [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] mark backup as error `couldn't get response from all shards: convergeClusterWithTimeout: 5m0s: reached converge timeout`: <nil>
2024-07-24T22:05:03Z D [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] set balancer on
2024-07-24T22:05:03Z E [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] backup: couldn't get response from all shards: convergeClusterWithTimeout: 5m0s: reached converge timeout
2024-07-24T22:05:03Z D [www-config/172.26.96.87:27017] [backup/2024-07-24T22:00:00Z] releasing lock

Additionally, it would be great to share all log entries that happened near: Tue Jul 23 11:31:55 UTC 2024 . That is the date when problematic locks are created.

It seems like those logs are no longer in the db. The first entries in the pbmLog collection is 1721767822 (Tue Jul 23 22:50:22 2024 SAST). (in most cases the locks was released by deleting entries from the pbmLock collection.

The journalctl log (on one of the configsvr nodes, does show a lock present before and after that timestamp though:

 2024-07-23T00:00:12.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 669e5238024eeba46c897782]
 2024-07-23T00:00:27.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 669e5238024eeba46c897782]
<snip>
2024-07-23T23:59:32.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 669e5238024eeba46c897782]
2024-07-23T23:59:47.000+0200 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 669e5238024eeba46c897782]

The backup (and cleanup) at midnight SAST did seem to get that lock released though… (it is the next journal entries after that timestamp)

2024-07-24T00:00:01.000+0200 I got command backup [name: 2024-07-23T22:00:00Z, compression: zstd (level: default)] <ts: 1721772000>
2024-07-24T00:00:01.000+0200 I got epoch {1721734315 330}
2024-07-24T00:00:01.000+0200 I got command cleanup <ts: 1721772000>
2024-07-24T00:00:01.000+0200 I got epoch {1721734315 330}
2024-07-24T00:00:01.000+0200 D [backup/2024-07-23T22:00:00Z] init backup meta
2024-07-24T00:00:01.000+0200 D [cleanup] skip: lock not acquired
2024-07-24T00:00:01.000+0200 D [backup/2024-07-23T22:00:00Z] nomination list for rs-wwww-shard-001: [[172.26.97.51:27017 172.26.96.92:>
2024-07-24T00:00:01.000+0200 D [backup/2024-07-23T22:00:00Z] nomination list for rs-wwww-shard-000: [[172.26.96.190:27017 172.26.96.44>
2024-07-24T00:00:01.000+0200 D [backup/2024-07-23T22:00:00Z] nomination list for rs-wwww-shard-002: [[172.26.96.31:27017 172.26.97.40:>
2024-07-24T00:00:01.000+0200 D [backup/2024-07-23T22:00:00Z] nomination list for www-config: [[172.26.96.87:27017 172.26.96.176:27017]>
2024-07-24T00:00:01.000+0200 D [backup/2024-07-23T22:00:00Z] nomination rs-wwww-shard-001, set candidates [172.26.97.51:27017 172.26.9>
2024-07-24T00:00:01.000+0200 D [backup/2024-07-23T22:00:00Z] nomination rs-wwww-shard-000, set candidates [172.26.96.190:27017 172.26.>
2024-07-24T00:00:01.000+0200 D [backup/2024-07-23T22:00:00Z] nomination rs-wwww-shard-002, set candidates [172.26.96.31:27017 172.26.9>
2024-07-24T00:00:01.000+0200 D [backup/2024-07-23T22:00:00Z] nomination www-config, set candidates [172.26.96.87:27017 172.26.96.176:2>
2024-07-24T00:00:02.000+0200 I [pitr] oplog slicer is paused for lock [Cleanup backups and PITR chunks, opid: 66a027e0ca6ae698210d9eeb]
2024-07-24T00:00:06.000+0200 D [backup/2024-07-23T22:00:00Z] nomination rs-wwww-shard-000, set candidates [172.26.97.42:27017]
2024-07-24T00:00:06.000+0200 D [backup/2024-07-23T22:00:00Z] nomination www-config, set candidates [172.26.97.76:27017]
2024-07-24T00:00:06.000+0200 D [backup/2024-07-23T22:00:00Z] nomination rs-wwww-shard-002, set candidates [172.26.96.253:27017]
2024-07-24T00:00:06.000+0200 D [backup/2024-07-23T22:00:00Z] nomination rs-wwww-shard-001, set candidates [172.26.96.133:27017]
2024-07-24T00:00:17.000+0200 E [pitr] init: get backup metadata: not found
2024-07-24T00:00:26.000+0200 D [backup/2024-07-23T22:00:00Z] nomination timeout
2024-07-24T00:00:26.000+0200 D [backup/2024-07-23T22:00:00Z] skip after nomination, probably started by another node
2024-07-24T00:00:47.000+0200 E [pitr] init: get backup metadata: not found
<snip>
2024-07-24T00:04:47.000+0200 E [pitr] init: get backup metadata: not found