Backup failed with ERROR: couldn’t get response from all shards

An error appeared when creating a backup:

2023-04-24T06:55:31Z 3.53MB [ERROR: check cluster for dump done: convergeCluster: lost shard configReplSet, last beat ts: 1682369512] [2023-04-24T20:52:23Z]

All cluster members have pbm-agent v2.1.0 OK status


[mongod@ip-10-80-10-113 ~]$ pbm status
Cluster:

shard3ReplSet:

  • shard3ReplSet/pmgo-pl206.int.compumark.com:27018 [P]: pbm-agent v2.1.0 OK
  • shard3ReplSet/pmgo-pl205.int.compumark.com:27018 [S]: pbm-agent v2.1.0 OK
  • shard3ReplSet/pmgo-pl201.int.compumark.com:27028 [S]: pbm-agent v2.1.0 OK
    configReplSet:
  • configReplSet/pmgo-pl204.int.compumark.com:27019 [S]: pbm-agent v2.1.0 OK
  • configReplSet/pmgo-pl202.int.compumark.com:27019 [P]: pbm-agent v2.1.0 OK
  • configReplSet/pmgo-pl206.int.compumark.com:27029 [S]: pbm-agent v2.1.0 OK
    shard2ReplSet:
  • shard2ReplSet/pmgo-pl204.int.compumark.com:27018 [P]: pbm-agent v2.1.0 OK
  • shard2ReplSet/pmgo-pl202.int.compumark.com:27018 [S]: pbm-agent v2.1.0 OK
  • shard2ReplSet/pmgo-pl205.int.compumark.com:27028 [S]: pbm-agent v2.1.0 OK
    shard1ReplSet:
  • shard1ReplSet/pmgo-pl203.int.compumark.com:27018 [P]: pbm-agent v2.1.0 OK
  • shard1ReplSet/pmgo-pl201.int.compumark.com:27018 [S]: pbm-agent v2.1.0 OK
  • shard1ReplSet/pmgo-pl202.int.compumark.com:27028 [S]: pbm-agent v2.1.0 OK

PITR incremental backup:

Status [OFF]

Currently running:

(none)

Backups:

S3 us-east-1 s3://cm-mongo-db-shared-prod-va/pbm/backup/
Snapshots:
2023-04-24T06:55:31Z 3.53MB [ERROR: check cluster for dump done: convergeCluster: lost shard configReplSet, last beat ts: 1682369512] [2023-04-24T20:52:23Z]
2023-04-17T10:48:36Z 2.55TB [restore_to_time: 2023-04-18T02:45:06Z]
2023-04-05T14:48:57Z 2.45TB [restore_to_time: 2023-04-06T03:41:38Z]
PITR chunks [0.00B]:
2023-04-06T03:41:39Z - 2023-04-17T10:45:23Z

Logs below:

[mongod@ip-10-80-10-113 ~]$ pbm logs -t 1000 -s D --event=backup/2023-04-24T06:55:31Z
2023-04-24T06:55:32Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] init backup meta
2023-04-24T06:55:32Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] nomination list for shard3ReplSet: [[pmgo-pl201.int.compumark.com:27028] [pmgo-pl205.int.compumark.com:27018] [pmgo-pl206.int.compumark.com:27018]]
2023-04-24T06:55:32Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] nomination list for configReplSet: [[pmgo-pl206.int.compumark.com:27029 pmgo-pl204.int.compumark.com:27019] [pmgo-pl202.int.compumark.com:27019]]
2023-04-24T06:55:32Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] nomination list for shard1ReplSet: [[pmgo-pl202.int.compumark.com:27028] [pmgo-pl201.int.compumark.com:27018] [pmgo-pl203.int.compumark.com:27018]]
2023-04-24T06:55:32Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] nomination list for shard2ReplSet: [[pmgo-pl205.int.compumark.com:27028] [pmgo-pl202.int.compumark.com:27018] [pmgo-pl204.int.compumark.com:27018]]
2023-04-24T06:55:32Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] nomination shard3ReplSet, set candidates [pmgo-pl201.int.compumark.com:27028]
2023-04-24T06:55:32Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] nomination configReplSet, set candidates [pmgo-pl206.int.compumark.com:27029 pmgo-pl204.int.compumark.com:27019]
2023-04-24T06:55:32Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] nomination shard1ReplSet, set candidates [pmgo-pl202.int.compumark.com:27028]
2023-04-24T06:55:32Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] nomination shard2ReplSet, set candidates [pmgo-pl205.int.compumark.com:27028]
2023-04-24T06:55:32Z I [shard3ReplSet/pmgo-pl201.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] backup started
2023-04-24T06:55:32Z D [shard3ReplSet/pmgo-pl205.int.compumark.com:27018] [backup/2023-04-24T06:55:31Z] skip after nomination, probably started by another node
2023-04-24T06:55:32Z I [configReplSet/pmgo-pl204.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] backup started
2023-04-24T06:55:32Z D [shard3ReplSet/pmgo-pl206.int.compumark.com:27018] [backup/2023-04-24T06:55:31Z] skip after nomination, probably started by another node
2023-04-24T06:55:33Z D [configReplSet/pmgo-pl206.int.compumark.com:27029] [backup/2023-04-24T06:55:31Z] skip after nomination, probably started by another node
2023-04-24T06:55:33Z I [shard1ReplSet/pmgo-pl202.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] backup started
2023-04-24T06:55:33Z I [shard2ReplSet/pmgo-pl205.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] backup started
2023-04-24T06:55:33Z D [shard2ReplSet/pmgo-pl204.int.compumark.com:27018] [backup/2023-04-24T06:55:31Z] skip after nomination, probably started by another node
2023-04-24T06:55:33Z D [shard1ReplSet/pmgo-pl203.int.compumark.com:27018] [backup/2023-04-24T06:55:31Z] skip after nomination, probably started by another node
2023-04-24T06:55:33Z D [shard1ReplSet/pmgo-pl201.int.compumark.com:27018] [backup/2023-04-24T06:55:31Z] skip after nomination, probably started by another node
2023-04-24T06:55:33Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] skip after nomination, probably started by another node
2023-04-24T06:55:33Z D [shard2ReplSet/pmgo-pl202.int.compumark.com:27018] [backup/2023-04-24T06:55:31Z] skip after nomination, probably started by another node
2023-04-24T06:55:37Z D [shard3ReplSet/pmgo-pl201.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] wait for tmp users {1682319337 8}
2023-04-24T06:55:37Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] bcp nomination: configReplSet won by pmgo-pl204.int.compumark.com:27019
2023-04-24T06:55:37Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] bcp nomination: shard3ReplSet won by pmgo-pl201.int.compumark.com:27028
2023-04-24T06:55:37Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] bcp nomination: shard1ReplSet won by pmgo-pl202.int.compumark.com:27028
2023-04-24T06:55:37Z D [configReplSet/pmgo-pl202.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] bcp nomination: shard2ReplSet won by pmgo-pl205.int.compumark.com:27028
2023-04-24T06:55:37Z D [shard1ReplSet/pmgo-pl202.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] wait for tmp users {1682319337 9}
2023-04-24T06:55:38Z D [shard2ReplSet/pmgo-pl205.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] wait for tmp users {1682319338 3}
2023-04-24T06:55:38Z D [configReplSet/pmgo-pl204.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] wait for tmp users {1682319338 41}
2023-04-24T06:55:40Z I [configReplSet/pmgo-pl204.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] mongodump finished, waiting for the oplog
2023-04-24T20:52:23Z I [configReplSet/pmgo-pl204.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] dropping tmp collections
2023-04-24T20:52:23Z I [configReplSet/pmgo-pl204.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] mark RS as error check cluster for dump done: convergeCluster: lost shard configReplSet, last beat ts: 1682369512:
2023-04-24T20:52:23Z I [configReplSet/pmgo-pl204.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] mark backup as error check cluster for dump done: convergeCluster: lost shard configReplSet, last beat ts: 1682369512:
2023-04-24T20:52:23Z E [configReplSet/pmgo-pl204.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] backup: check cluster for dump done: convergeCluster: lost shard configReplSet, last beat ts: 1682369512
2023-04-24T20:52:23Z D [configReplSet/pmgo-pl204.int.compumark.com:27019] [backup/2023-04-24T06:55:31Z] releasing lock
2023-04-25T04:32:30Z I [shard2ReplSet/pmgo-pl205.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] mongodump finished, waiting for the oplog
2023-04-25T04:32:31Z I [shard2ReplSet/pmgo-pl205.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] dropping tmp collections
2023-04-25T04:32:31Z I [shard2ReplSet/pmgo-pl205.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] mark RS as error waiting for dump done: backup stuck, last beat ts: 1682369541:
2023-04-25T04:32:31Z E [shard2ReplSet/pmgo-pl205.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] backup: waiting for dump done: backup stuck, last beat ts: 1682369541
2023-04-25T04:32:31Z D [shard2ReplSet/pmgo-pl205.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] releasing lock
2023-04-25T04:41:04Z I [shard1ReplSet/pmgo-pl202.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] mongodump finished, waiting for the oplog
2023-04-25T04:41:05Z I [shard1ReplSet/pmgo-pl202.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] dropping tmp collections
2023-04-25T04:41:05Z I [shard1ReplSet/pmgo-pl202.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] mark RS as error waiting for dump done: backup stuck, last beat ts: 1682369541:
2023-04-25T04:41:05Z E [shard1ReplSet/pmgo-pl202.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] backup: waiting for dump done: backup stuck, last beat ts: 1682369541
2023-04-25T04:41:05Z D [shard1ReplSet/pmgo-pl202.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] releasing lock
2023-04-25T04:48:29Z I [shard3ReplSet/pmgo-pl201.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] mongodump finished, waiting for the oplog
2023-04-25T04:48:30Z I [shard3ReplSet/pmgo-pl201.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] dropping tmp collections
2023-04-25T04:48:30Z I [shard3ReplSet/pmgo-pl201.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] mark RS as error waiting for dump done: backup stuck, last beat ts: 1682369541:
2023-04-25T04:48:30Z E [shard3ReplSet/pmgo-pl201.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] backup: waiting for dump done: backup stuck, last beat ts: 1682369541
2023-04-25T04:48:30Z D [shard3ReplSet/pmgo-pl201.int.compumark.com:27028] [backup/2023-04-24T06:55:31Z] releasing lock

Hi, there seems to be a problem with the config replica set. Did you check the pbm agent is able to reach it, and it is in good state?