Error : "couldn't get response from all shards: convergeClusterWithTimeout: reached converge timeout

Hello Team,

I have configured the PBM 1.3.4 on my STG shard cluster.
My Sharad setup:
3 node of config
2 shard with 3 node PSS

When I’m trying to run backup I’m getting below error.
Even when run the “pbm list” command

Failed with "couldn’t get response from all shards: convergeClusterWithTimeout: reached converge timeout

When we checked the shard nodes PBM logs I can see backup is running but in configrs we can see above error and even not able to fetch details from pbm list command getting same error.

Could you please help me to fix this issue.

Thank you

Please see the below details:
Backup execute command:
[root@config-rep1:~$ ] pbm backup --compression=s2
Starting backup ‘2024-05-15T11:55:58Z’…
Error starting backup: no confirmation that backup has successfully started. Replsets status:

  • Backup on replicaset “rsShard2” in state: running
  • Backup on replicaset “configReplSet” in state: running
  • Backup on replicaset “rsShard1” in state: running

==================================
Error in config server pbm log:

[root@config-rep2:~$ ] tail -f /var/log/pbm-agent.log
GitBranch: release-1.3.4
BuildTime: 2020-11-17_15:43_UTC
GoVersion: go1.14.2
2024-05-15T11:54:58.000+0000 [INFO] node: configReplSet/config-rep2.staging.moblize.com:27019
2024-05-15T11:54:58.000+0000 [INFO] listening for the commands
2024-05-15T11:54:58.000+0000 [INFO] starting PITR routine
2024-05-15T11:55:49.000+0000 [INFO] got command delete <ts: 1715774149>
2024-05-15T11:55:49.000+0000 [INFO] delete/2024-05-15T11:52:48Z: deleting backup
2024-05-15T11:55:54.000+0000 [INFO] delete/2024-05-15T11:52:48Z: done
2024-05-15T11:55:59.000+0000 [INFO] got command backup [name: 2024-05-15T11:55:58Z, compression: s2] <ts: 1715774158>

2024-05-15T11:56:16.000+0000 [INFO] backup/2024-05-15T11:55:58Z: backup started
2024-05-15T11:56:50.000+0000 [INFO] backup/2024-05-15T11:55:58Z: mark backup as error couldn't get response from all shards: convergeClusterWithTimeout: reached converge timeout:
2024-05-15T11:56:50.000+0000 [ERROR] backup/2024-05-15T11:55:58Z: backup: couldn’t get response from all shards: convergeClusterWithTimeout: reached converge timeout

====================
Shard1 backup status from pbm logs:

Shard1:

2024/05/15 12:03:50 [##################…] staging.logData 1771600/2292541 (77.3%)
2024/05/15 12:04:50 [#####################…] staging.logData 2084129/2292541 (90.9%)
2024/05/15 12:05:25 [########################] staging.logData 2292541/2292541 (100.0%)
2024-05-15T12:05:25.080+0000 Mux close namespace staging.logData
2024-05-15T12:05:25.080+0000 done dumping staging.logData (2292541 documents)
2024-05-15T12:05:25.081+0000 dump phase III: the oplog
2024-05-15T12:05:25.081+0000 finishing dump
2024-05-15T12:05:25.081+0000 Mux finish
2024-05-15T12:05:25.081+0000 mux completed successfully
2024-05-15T12:05:25.000+0000 [INFO] backup/2024-05-15T11:55:58Z: mongodump finished, waiting for the oplog

=====================================
shard2 backup status:

2024/05/15 12:01:50 [###############…] staging.logData 1474070/2224070 (66.3%)
2024/05/15 12:02:50 [##################…] staging.logData 1688545/2224070 (75.9%)
2024/05/15 12:03:50 [#####################…] staging.logData 1993588/2224070 (89.6%)
2024/05/15 12:04:26 [########################] staging.logData 2224070/2224070 (100.0%)
2024-05-15T12:04:26.448+0000 Mux close namespace staging.logData
2024-05-15T12:04:26.448+0000 done dumping staging.logData (2224070 documents)
2024-05-15T12:04:26.449+0000 dump phase III: the oplog
2024-05-15T12:04:26.449+0000 finishing dump
2024-05-15T12:04:26.449+0000 Mux finish
2024-05-15T12:04:26.449+0000 mux completed successfully
2024-05-15T12:04:26.000+0000 [INFO] backup/2024-05-15T11:55:58Z: mongodump finished, waiting for the oplog
2024-05-15T12:04:28.000+0000 [ERROR] backup/2024-05-15T11:55:58Z: backup: waiting for dump done: backup stuck, last beat ts: 1715774205

=================================
Backup list from Config server:

[root@config-rep1:~$ ] pbm list
Backup snapshots:
2024-05-15T11:55:58Z Failed with “couldn’t get response from all shards: convergeClusterWithTimeout: reached converge timeout”

Shard backup can be seen in s3 storage:
[root@config-rep1:~$ ] aws s3 ls s3://bucketname/data/pbm/stgbackup/
2024-05-15 11:56:51 19433039870 2024-05-15T11:55:58Z_rsShard1.dump.s2
2024-05-15 11:56:52 19811250732 2024-05-15T11:55:58Z_rsShard2.dump.s2
[root@config-rep1:~$ ]

Hi, the error appears when either one of the shards is unresponsive or one of the pbm agents is stuck. I suggest restart all agents and try again

I tried restarting all the pbm agent on all the shard nodes and configrs server but no luck getting same error.

Hello Team,

We noticed shard _id value is different that the actual value in shard.

See the below details from config server.

configReplSet:PRIMARY> db.shards.find()
{ "_id" : "shard0000", "host" : "rsShard1/shard01-rep1.staging.example.com:27017,shard01-rep2.staging.example.com:27017,shard01-rep3.staging.example.com:27017", "state" : 1 }
{ "_id" : "shard0001", "host" : "rsShard2/shard02-rep1.staging.example.com:27017,shard02-rep2.staging.example.com:27017,shard02-rep3.staging.example.com:27017", "state" : 1 }

The actual shard names:

rsShard1:PRIMARY>
rsShard2:PRIMARY>

Could you please check and let us know if this could be the issue.

Thank you

Can some one please help us here our implementation is stuck due to this issue.

Could some one look into the issue and suggest next step to fix this.