Error : "couldn't get response from all shards: convergeClusterWithTimeout: reached converge timeout

Suresh_Hemke · May 15, 2024, 12:19pm

Hello Team,

I have configured the PBM 1.3.4 on my STG shard cluster.
My Sharad setup:
3 node of config
2 shard with 3 node PSS

When I’m trying to run backup I’m getting below error.
Even when run the “pbm list” command

Failed with "couldn’t get response from all shards: convergeClusterWithTimeout: reached converge timeout

When we checked the shard nodes PBM logs I can see backup is running but in configrs we can see above error and even not able to fetch details from pbm list command getting same error.

Could you please help me to fix this issue.

Thank you

Suresh_Hemke · May 15, 2024, 12:40pm

Please see the below details:
Backup execute command:
[root@config-rep1:~$ ] pbm backup --compression=s2
Starting backup ‘2024-05-15T11:55:58Z’…
Error starting backup: no confirmation that backup has successfully started. Replsets status:

Backup on replicaset “rsShard2” in state: running
Backup on replicaset “configReplSet” in state: running
Backup on replicaset “rsShard1” in state: running

==================================
Error in config server pbm log:

[root@config-rep2:~$ ] tail -f /var/log/pbm-agent.log
GitBranch: release-1.3.4
BuildTime: 2020-11-17_15:43_UTC
GoVersion: go1.14.2
2024-05-15T11:54:58.000+0000 [INFO] node: configReplSet/config-rep2.staging.moblize.com:27019
2024-05-15T11:54:58.000+0000 [INFO] listening for the commands
2024-05-15T11:54:58.000+0000 [INFO] starting PITR routine
2024-05-15T11:55:49.000+0000 [INFO] got command delete <ts: 1715774149>
2024-05-15T11:55:49.000+0000 [INFO] delete/2024-05-15T11:52:48Z: deleting backup
2024-05-15T11:55:54.000+0000 [INFO] delete/2024-05-15T11:52:48Z: done
2024-05-15T11:55:59.000+0000 [INFO] got command backup [name: 2024-05-15T11:55:58Z, compression: s2] <ts: 1715774158>

2024-05-15T11:56:16.000+0000 [INFO] backup/2024-05-15T11:55:58Z: backup started
2024-05-15T11:56:50.000+0000 [INFO] backup/2024-05-15T11:55:58Z: mark backup as error couldn't get response from all shards: convergeClusterWithTimeout: reached converge timeout:
2024-05-15T11:56:50.000+0000 [ERROR] backup/2024-05-15T11:55:58Z: backup: couldn’t get response from all shards: convergeClusterWithTimeout: reached converge timeout

====================
Shard1 backup status from pbm logs:

Shard1:

2024/05/15 12:03:50 [##################…] staging.logData 1771600/2292541 (77.3%)
2024/05/15 12:04:50 [#####################…] staging.logData 2084129/2292541 (90.9%)
2024/05/15 12:05:25 [########################] staging.logData 2292541/2292541 (100.0%)
2024-05-15T12:05:25.080+0000 Mux close namespace staging.logData
2024-05-15T12:05:25.080+0000 done dumping staging.logData (2292541 documents)
2024-05-15T12:05:25.081+0000 dump phase III: the oplog
2024-05-15T12:05:25.081+0000 finishing dump
2024-05-15T12:05:25.081+0000 Mux finish
2024-05-15T12:05:25.081+0000 mux completed successfully
2024-05-15T12:05:25.000+0000 [INFO] backup/2024-05-15T11:55:58Z: mongodump finished, waiting for the oplog

=====================================
shard2 backup status:

2024/05/15 12:01:50 [###############…] staging.logData 1474070/2224070 (66.3%)
2024/05/15 12:02:50 [##################…] staging.logData 1688545/2224070 (75.9%)
2024/05/15 12:03:50 [#####################…] staging.logData 1993588/2224070 (89.6%)
2024/05/15 12:04:26 [########################] staging.logData 2224070/2224070 (100.0%)
2024-05-15T12:04:26.448+0000 Mux close namespace staging.logData
2024-05-15T12:04:26.448+0000 done dumping staging.logData (2224070 documents)
2024-05-15T12:04:26.449+0000 dump phase III: the oplog
2024-05-15T12:04:26.449+0000 finishing dump
2024-05-15T12:04:26.449+0000 Mux finish
2024-05-15T12:04:26.449+0000 mux completed successfully
2024-05-15T12:04:26.000+0000 [INFO] backup/2024-05-15T11:55:58Z: mongodump finished, waiting for the oplog
2024-05-15T12:04:28.000+0000 [ERROR] backup/2024-05-15T11:55:58Z: backup: waiting for dump done: backup stuck, last beat ts: 1715774205

=================================
Backup list from Config server:

[root@config-rep1:~$ ] pbm list
Backup snapshots:
2024-05-15T11:55:58Z Failed with “couldn’t get response from all shards: convergeClusterWithTimeout: reached converge timeout”

Shard backup can be seen in s3 storage:
[root@config-rep1:~$ ] aws s3 ls s3://bucketname/data/pbm/stgbackup/
2024-05-15 11:56:51 19433039870 2024-05-15T11:55:58Z_rsShard1.dump.s2
2024-05-15 11:56:52 19811250732 2024-05-15T11:55:58Z_rsShard2.dump.s2
[root@config-rep1:~$ ]

Ivan_Groenewold · May 16, 2024, 1:08pm

Hi, the error appears when either one of the shards is unresponsive or one of the pbm agents is stuck. I suggest restart all agents and try again

Suresh_Hemke · May 16, 2024, 4:52pm

I tried restarting all the pbm agent on all the shard nodes and configrs server but no luck getting same error.

Suresh_Hemke · May 16, 2024, 6:56pm

Hello Team,

We noticed shard _id value is different that the actual value in shard.

See the below details from config server.

configReplSet:PRIMARY> db.shards.find()
{ "_id" : "shard0000", "host" : "rsShard1/shard01-rep1.staging.example.com:27017,shard01-rep2.staging.example.com:27017,shard01-rep3.staging.example.com:27017", "state" : 1 }
{ "_id" : "shard0001", "host" : "rsShard2/shard02-rep1.staging.example.com:27017,shard02-rep2.staging.example.com:27017,shard02-rep3.staging.example.com:27017", "state" : 1 }

The actual shard names:

rsShard1:PRIMARY>
rsShard2:PRIMARY>

Could you please check and let us know if this could be the issue.

Thank you

Suresh_Hemke · May 17, 2024, 5:08pm

Can some one please help us here our implementation is stuck due to this issue.

Suresh_Hemke · May 24, 2024, 11:21am

Could some one look into the issue and suggest next step to fix this.

Topic		Replies	Views
Backup: couldn't get response from all shards: convergeClusterWithTimeout: 33s: reached converge timeout Percona Backup for MongoDB pbm	1	86	November 6, 2024
ERROR: couldn’t get response from all shards: convergeClusterWithTimeout: Percona Backup for MongoDB pbm	6	690	November 14, 2023
Cannot take backup on sharded cluster Percona Backup for MongoDB	2	1349	June 11, 2020
ERROR: couldn’t get response from all shards: convergeClusterWithTimeout: Percona Backup for MongoDB	5	1156	June 4, 2021
Another convergence Timeout Percona Backup for MongoDB	2	134	July 29, 2024

Error : "couldn't get response from all shards: convergeClusterWithTimeout: reached converge timeout

Related topics