Backup: couldn't get response from all shards: convergeClusterWithTimeout: 33s: reached converge timeout

PBM backup can’t perform a physical backup on a sharded cluster.

2024-11-06T08:25:36Z I [ConfigReplSet/mongo-host-cfg-03.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup started
2024-11-06T08:25:36Z I [rs1/am-mongo-host-rs1-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup started
2024-11-06T08:25:36Z D [rs1/ld-mongo-host-rs1-03.prod.env:27017] [backup/2024-11-06T08:25:35Z] skip after nomination, probably started by another node
2024-11-06T08:25:36Z I [ConfigReplSet/mongo-host-cfg-01.prod.env:27017] got command backup [name: 2024-11-06T08:25:35Z, compression: s2 (level: default)] <ts: 1730881535>
2024-11-06T08:25:36Z I [ConfigReplSet/mongo-host-cfg-01.prod.env:27017] got epoch {1729153882 2506}
2024-11-06T08:25:36Z D [rs1/am-mongo-host-rs1-01.prod.env:27017] [backup/2024-11-06T08:25:35Z] skip after nomination, probably started by another node
2024-11-06T08:25:36Z I [rs2/am-mongo-host-rs2-01.prod.env:27017] got command backup [name: 2024-11-06T08:25:35Z, compression: s2 (level: default)] <ts: 1730881535>
2024-11-06T08:25:36Z I [rs2/am-mongo-host-rs2-01.prod.env:27017] got epoch {1729153882 2506}
2024-11-06T08:25:36Z I [rs2/am-mongo-host-rs2-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup started
2024-11-06T08:25:36Z D [ConfigReplSet/mongo-host-cfg-01.prod.env:27017] [backup/2024-11-06T08:25:35Z] skip after nomination, probably started by another node
2024-11-06T08:25:37Z D [rs2/ld-mongo-host-rs2-03.prod.env:27017] [backup/2024-11-06T08:25:35Z] skip after nomination, probably started by another node
2024-11-06T08:25:37Z D [rs2/am-mongo-host-rs2-01.prod.env:27017] [backup/2024-11-06T08:25:35Z] skip after nomination, probably started by another node
2024-11-06T08:25:37Z D [ConfigReplSet/mongo-host-cfg-03.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup cursor id: 396b36bf-e18a-48f9-bfc1-9c1b3b050866
2024-11-06T08:25:40Z D [ConfigReplSet/mongo-host-cfg-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] bcp nomination: rs2 won by am-mongo-host-rs2-02.prod.env:27017
2024-11-06T08:25:40Z D [ConfigReplSet/mongo-host-cfg-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] bcp nomination: rs1 won by am-mongo-host-rs1-02.prod.env:27017
2024-11-06T08:25:41Z D [ConfigReplSet/mongo-host-cfg-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] skip after nomination, probably started by another node
2024-11-06T08:26:10Z D [ConfigReplSet/mongo-host-cfg-03.prod.env:27017] [backup/2024-11-06T08:25:35Z] stop cursor polling: , cursor err:
2024-11-06T08:26:10Z I [ConfigReplSet/mongo-host-cfg-03.prod.env:27017] [backup/2024-11-06T08:25:35Z] mark RS as error couldn't get response from all shards: convergeClusterWithTimeout: 33s: reached converge timeout:
2024-11-06T08:26:10Z I [ConfigReplSet/mongo-host-cfg-03.prod.env:27017] [backup/2024-11-06T08:25:35Z] mark backup as error couldn't get response from all shards: convergeClusterWithTimeout: 33s: reached converge timeout:
2024-11-06T08:26:10Z E [ConfigReplSet/mongo-host-cfg-03.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup: couldn’t get response from all shards: convergeClusterWithTimeout: 33s: reached converge timeout
2024-11-06T08:26:10Z D [ConfigReplSet/mongo-host-cfg-03.prod.env:27017] [backup/2024-11-06T08:25:35Z] releasing lock
2024-11-06T08:26:20Z I [ConfigReplSet/mongo-host-cfg-02.prod.env:27017] [pitr] created chunk 2024-11-06T08:16:20 - 2024-11-06T08:26:20. Next chunk creation scheduled to begin at ~2024-11-06 08:36:20.955994635 +0000 GMT m=+1813830.482446846
2024-11-06T08:26:24Z I [rs2/am-mongo-host-rs2-02.prod.env:27017] [pitr] created chunk 2024-11-06T08:16:20 - 2024-11-06T08:26:20. Next chunk creation scheduled to begin at ~2024-11-06 08:36:24.664786273 +0000 UTC m=+1813834.935129087
2024-11-06T08:26:29Z D [rs1/am-mongo-host-rs1-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] a checkpoint took place, retrying
2024-11-06T08:26:30Z D [rs1/am-mongo-host-rs1-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup cursor id: ea45988e-0b26-4a34-a681-dee3f1420eab
2024-11-06T08:26:31Z D [rs1/am-mongo-host-rs1-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] stop cursor polling: , cursor err:
2024-11-06T08:26:31Z I [rs1/am-mongo-host-rs1-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] mark RS as error waiting for running: cluster failed: <nil>:
2024-11-06T08:26:31Z E [rs1/am-mongo-host-rs1-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup: waiting for running: cluster failed:
2024-11-06T08:26:31Z D [rs1/am-mongo-host-rs1-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] releasing lock
2024-11-06T08:26:39Z I [rs1/am-mongo-host-rs1-02.prod.env:27017] [pitr] created chunk 2024-11-06T08:16:35 - 2024-11-06T08:26:35. Next chunk creation scheduled to begin at ~2024-11-06 08:36:39.864908106 +0000 UTC m=+1813850.162375844
2024-11-06T08:26:47Z D [rs2/am-mongo-host-rs2-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] a checkpoint took place, retrying
2024-11-06T08:26:48Z D [rs2/am-mongo-host-rs2-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup cursor id: af412f2a-1d45-458b-a839-4e3a3baecfb9
2024-11-06T08:26:49Z D [rs2/am-mongo-host-rs2-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] stop cursor polling: , cursor err:
2024-11-06T08:26:49Z I [rs2/am-mongo-host-rs2-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] mark RS as error waiting for running: backup stuck, last beat ts: 1730881566:
2024-11-06T08:26:49Z E [rs2/am-mongo-host-rs2-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup: waiting for running: backup stuck, last beat ts: 1730881566
2024-11-06T08:26:49Z D [rs2/am-mongo-host-rs2-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] releasing lock

pbm status
Cluster:

rs2:

  • rs2/am-mongo-host-rs2-01.prod.env:27017 [P]: pbm-agent v2.5.0 OK
  • rs2/ld-mongo-host-rs2-03.prod.env:27017 [S]: pbm-agent v2.5.0 OK
  • rs2/am-mongo-host-rs2-02.prod.env:27017 [S]: pbm-agent v2.5.0 OK
    rs1:
  • rs1/am-mongo-host-rs1-01.prod.env:27017 [P]: pbm-agent v2.5.0 OK
  • rs1/am-mongo-host-rs1-02.prod.env:27017 [S]: pbm-agent v2.5.0 OK
  • rs1/ld-mongo-host-rs1-03.prod.env:27017 [S]: pbm-agent v2.5.0 OK
    ConfigReplSet:
  • ConfigReplSet/mongo-host-cfg-01.prod.env:27017 [S]: pbm-agent v2.5.0 OK
  • ConfigReplSet/mongo-host-cfg-02.prod.env:27017 [P]: pbm-agent v2.5.0 OK
  • ConfigReplSet/mongo-host-cfg-03.prod.env:27017 [S]: pbm-agent v2.5.0 OK

pbm-agent v2.5.0

/usr/bin/mongod --version
db version v6.0.15-12
Build Info: {
“version”: “6.0.15-12”,
“gitVersion”: “2c4ff0c994742506096fae92dc182d61380c2854”,
“openSSLVersion”: “OpenSSL 1.0.2k-fips 26 Jan 2017”,
“modules”: ,
“proFeatures”: ,
“allocator”: “tcmalloc”,
“environment”: {
“distarch”: “x86_64”,
“target_arch”: “x86_64”
}
}

These are the nodes that got selected to run the backup:

2024-11-06T08:25:36Z I [ConfigReplSet/mongo-host-cfg-03.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup started
2024-11-06T08:25:36Z I [rs1/am-mongo-host-rs1-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup started
2024-11-06T08:25:36Z I [rs2/am-mongo-host-rs2-02.prod.env:27017] [backup/2024-11-06T08:25:35Z] backup started

however:

2024-11-06T08:26:10Z I [ConfigReplSet/mongo-host-cfg-03.prod.env:27017] [backup/2024-11-06T08:25:35Z] mark backup as error couldn't get response from all shards: convergeClusterWithTimeout: 33s: reached converge timeout:

this means one of the backup agents is not making progress so likely there is an error. Check pbm-agent logs on each of the 3 selected nodes to get further details about it with:
journalctl -u pbm-agent