ERROR: couldn’t get response from all shards: convergeClusterWithTimeout:

aranjith0 · October 28, 2023, 1:35pm

An error appeared when creating a backup

2023-10-28T02:00:02Z 0.00B [ERROR: couldn’t get response from all shards: convergeClusterWithTimeout: reached converge timeout] [2023-10-28T02:00:37Z]

pbm-agent v2.0.5

What can cause such an error and how to fix what would not appear in the future?
Full backup runs every week and it always fails when it runs but when i rerun it, then backup starting without any issues…

aranjith0 · October 28, 2023, 1:36pm

checked that all the cluster members have pbm-agent running properly- All agent showing as “OK status”.

This troubles us a lot as i scheduled the backups in cronjob. it always fails and i had to re-run it manually. it makes me to work every weekends just to monitor the backup job.(meaning scheduling backup doesn’t work?)

[mongod@ip-10-80-10-113 ~]$ pbm status
Cluster:
========
shard2ReplSet:
  - shard2ReplSet/pmgo-pl204.int.compumark.com:27018 [S]: pbm-agent v2.0.5 OK
  - shard2ReplSet/pmgo-pl202.int.compumark.com:27018 [P]: pbm-agent v2.0.5 OK
  - shard2ReplSet/pmgo-pl205.int.compumark.com:27028 [S]: pbm-agent v2.0.5 OK
shard1ReplSet:
  - shard1ReplSet/pmgo-pl203.int.compumark.com:27018 [S]: pbm-agent v2.0.5 OK
  - shard1ReplSet/pmgo-pl201.int.compumark.com:27018 [P]: pbm-agent v2.0.5 OK
  - shard1ReplSet/pmgo-pl202.int.compumark.com:27028 [S]: pbm-agent v2.0.5 OK
configReplSet:
  - configReplSet/pmgo-pl204.int.compumark.com:27019 [S]: pbm-agent v2.0.5 OK
  - configReplSet/pmgo-pl202.int.compumark.com:27019 [P]: pbm-agent v2.0.5 OK
  - configReplSet/pmgo-pl206.int.compumark.com:27029 [S]: pbm-agent v2.0.5 OK
shard3ReplSet:
  - shard3ReplSet/pmgo-pl206.int.compumark.com:27018 [S]: pbm-agent v2.0.5 OK
  - shard3ReplSet/pmgo-pl205.int.compumark.com:27018 [P]: pbm-agent v2.0.5 OK
  - shard3ReplSet/pmgo-pl201.int.compumark.com:27028 [S]: pbm-agent v2.0.5 OK


PITR incremental backup:
========================
Status [OFF]

Currently running:
==================
(none)

Backups:
========
S3 us-east-1 s3://cm-mongo-db-shared-prod-va/pbm/backup/
  Snapshots:
    2023-10-28T02:00:02Z 0.00B <logical> [ERROR: couldn't get response from all shards: convergeClusterWithTimeout: reached converge timeout] [2023-10-28T02:00:37Z]
    2023-10-14T15:33:02Z 2.20TB <logical> [restore_to_time: 2023-10-15T03:24:09Z]

Mukesh_Kumar · October 29, 2023, 12:23pm

Hi,
First of all, I would suggest upgrading the pbm to latest version 2.3.0.
Also, creating a physical backup, cursor takes more than usual logical backup. Use the parameter backup.timeouts.startingStatus to avoid default timeout of 33 seconds. However, this parameter has been added in 2.2.1, so to use it you need to upgrade the PBM first.

pbm config --set backup.timeouts.startingStatus=120

Thanks,
Mukesh

aranjith0 · November 10, 2023, 3:40pm

Thanks.
I tried to made changes on version 2.2.1. But i don’t see that changes reflects in config file.

[mongod@tora-pl211 scripts]$ pbm config --set backup.timeouts.startingStatus=120
[backup.timeouts.startingStatus=120]
[mongod@tora-pl211 scripts]$ cat /etc/pbm_config.yaml
pitr:
enabled: false
oplogSpanMin: 0
compression: s2
storage:
type: s3
s3:
region: us-east-1
bucket: *****************
prefix: percona/backup/
credentials:
access-key-id: **************
secret-access-key: *******************
maxUploadParts: 10000
storageClass: STANDARD
insecureSkipTLSVerify: false
retryer:
numMaxRetries: 10
minRetryDelay: 60
maxRetryDelay: 60

Mukesh_Kumar · November 12, 2023, 6:30am

Hi,
Yep, it won’t edit the pbm_config.yml file.
You can review your pbm config changes by below:

pbm config | grep backup -A6
backup:
  priority:
    mk-rs1-db1:27017: 0.5
    mk-rs1-db2:27017: 1
  timeouts:
    startingStatus: 120
  compression: s2

If you want to apply changes in the pbm_config.yml file then add an entry there as below.

cat /etc/pbm_config.yaml | grep backup -A7
backup:
  priority:
    node-db1:27017: 0.5
    node-db2:27017: 1
    node-db3:27017: 1
  compression: s2
  timeouts:
    startingStatus: 120

aranjith0 · November 14, 2023, 5:35am

Ok Thanks @Mukesh_Kumar.
Also i am seeing below messages in the logs while restoring the backups to a new cluster.

2023-11-13T14:57:27Z W [shard1ReplSet/10.80.11.0:27038] [restore/2023-11-13T09:29:21.900315695Z] retryChunk got copy: context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 0s
2023-11-13T14:57:27Z I [shard1ReplSet/10.80.11.0:27038] [restore/2023-11-13T09:29:21.900315695Z] session recreated, resuming download

Mukesh_Kumar · November 14, 2023, 6:04am

Hi,
As it is different from the forum discussion, please discuss it in a separate/new forum.

Thanks.

Topic		Replies	Views
Error : "couldn't get response from all shards: convergeClusterWithTimeout: reached converge timeout Percona Backup for MongoDB percona	6	225	May 24, 2024
Backup: couldn't get response from all shards: convergeClusterWithTimeout: 33s: reached converge timeout Percona Backup for MongoDB pbm	1	86	November 6, 2024
ERROR: couldn’t get response from all shards: convergeClusterWithTimeout: Percona Backup for MongoDB	5	1156	June 4, 2021
Cannot take backup on sharded cluster Percona Backup for MongoDB	2	1349	June 11, 2020
Another convergence Timeout Percona Backup for MongoDB	2	134	July 29, 2024

ERROR: couldn’t get response from all shards: convergeClusterWithTimeout:

Related topics