ERROR: couldn’t get response from all shards: convergeClusterWithTimeout:

An error appeared when creating a backup

2023-10-28T02:00:02Z 0.00B [ERROR: couldn’t get response from all shards: convergeClusterWithTimeout: reached converge timeout] [2023-10-28T02:00:37Z]

pbm-agent v2.0.5

What can cause such an error and how to fix what would not appear in the future?
Full backup runs every week and it always fails when it runs but when i rerun it, then backup starting without any issues…

checked that all the cluster members have pbm-agent running properly- All agent showing as “OK status”.

This troubles us a lot as i scheduled the backups in cronjob. it always fails and i had to re-run it manually. it makes me to work every weekends just to monitor the backup job.(meaning scheduling backup doesn’t work?)

[mongod@ip-10-80-10-113 ~]$ pbm status
Cluster:
========
shard2ReplSet:
  - shard2ReplSet/pmgo-pl204.int.compumark.com:27018 [S]: pbm-agent v2.0.5 OK
  - shard2ReplSet/pmgo-pl202.int.compumark.com:27018 [P]: pbm-agent v2.0.5 OK
  - shard2ReplSet/pmgo-pl205.int.compumark.com:27028 [S]: pbm-agent v2.0.5 OK
shard1ReplSet:
  - shard1ReplSet/pmgo-pl203.int.compumark.com:27018 [S]: pbm-agent v2.0.5 OK
  - shard1ReplSet/pmgo-pl201.int.compumark.com:27018 [P]: pbm-agent v2.0.5 OK
  - shard1ReplSet/pmgo-pl202.int.compumark.com:27028 [S]: pbm-agent v2.0.5 OK
configReplSet:
  - configReplSet/pmgo-pl204.int.compumark.com:27019 [S]: pbm-agent v2.0.5 OK
  - configReplSet/pmgo-pl202.int.compumark.com:27019 [P]: pbm-agent v2.0.5 OK
  - configReplSet/pmgo-pl206.int.compumark.com:27029 [S]: pbm-agent v2.0.5 OK
shard3ReplSet:
  - shard3ReplSet/pmgo-pl206.int.compumark.com:27018 [S]: pbm-agent v2.0.5 OK
  - shard3ReplSet/pmgo-pl205.int.compumark.com:27018 [P]: pbm-agent v2.0.5 OK
  - shard3ReplSet/pmgo-pl201.int.compumark.com:27028 [S]: pbm-agent v2.0.5 OK


PITR incremental backup:
========================
Status [OFF]

Currently running:
==================
(none)

Backups:
========
S3 us-east-1 s3://cm-mongo-db-shared-prod-va/pbm/backup/
  Snapshots:
    2023-10-28T02:00:02Z 0.00B <logical> [ERROR: couldn't get response from all shards: convergeClusterWithTimeout: reached converge timeout] [2023-10-28T02:00:37Z]
    2023-10-14T15:33:02Z 2.20TB <logical> [restore_to_time: 2023-10-15T03:24:09Z]

1 Like

Hi,
First of all, I would suggest upgrading the pbm to latest version 2.3.0.
Also, creating a physical backup, cursor takes more than usual logical backup. Use the parameter backup.timeouts.startingStatus to avoid default timeout of 33 seconds. However, this parameter has been added in 2.2.1, so to use it you need to upgrade the PBM first.

pbm config --set backup.timeouts.startingStatus=120

Thanks,
Mukesh

2 Likes

Thanks.
I tried to made changes on version 2.2.1. But i don’t see that changes reflects in config file.

[mongod@tora-pl211 scripts]$ pbm config --set backup.timeouts.startingStatus=120
[backup.timeouts.startingStatus=120]
[mongod@tora-pl211 scripts]$ cat /etc/pbm_config.yaml
pitr:
enabled: false
oplogSpanMin: 0
compression: s2
storage:
type: s3
s3:
region: us-east-1
bucket: *****************
prefix: percona/backup/
credentials:
access-key-id: **************
secret-access-key: *******************
maxUploadParts: 10000
storageClass: STANDARD
insecureSkipTLSVerify: false
retryer:
numMaxRetries: 10
minRetryDelay: 60
maxRetryDelay: 60

Hi,
Yep, it won’t edit the pbm_config.yml file.
You can review your pbm config changes by below:

pbm config | grep backup -A6
backup:
  priority:
    mk-rs1-db1:27017: 0.5
    mk-rs1-db2:27017: 1
  timeouts:
    startingStatus: 120
  compression: s2

If you want to apply changes in the pbm_config.yml file then add an entry there as below.

cat /etc/pbm_config.yaml | grep backup -A7
backup:
  priority:
    node-db1:27017: 0.5
    node-db2:27017: 1
    node-db3:27017: 1
  compression: s2
  timeouts:
    startingStatus: 120

Ok Thanks @Mukesh_Kumar.
Also i am seeing below messages in the logs while restoring the backups to a new cluster.

2023-11-13T14:57:27Z W [shard1ReplSet/10.80.11.0:27038] [restore/2023-11-13T09:29:21.900315695Z] retryChunk got copy: context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 0s
2023-11-13T14:57:27Z I [shard1ReplSet/10.80.11.0:27038] [restore/2023-11-13T09:29:21.900315695Z] session recreated, resuming download

Hi,
As it is different from the forum discussion, please discuss it in a separate/new forum.

Thanks.