Backup failing after update to 2.0.3

Backup failing after update to 2.0.3. the backups got successful in earlier version 1.8.1.

[mongod@ip-10-80-10-113 ~]$ pbm status
Cluster:

shard3ReplSet:

  • shard3ReplSet/pmgo-pl206.int.compumark.com:27018 [P]: pbm-agent v2.0.3 OK
  • shard3ReplSet/pmgo-pl205.int.compumark.com:27018 [S]: pbm-agent v2.0.3 OK
  • shard3ReplSet/pmgo-pl201.int.compumark.com:27028 [S]: pbm-agent v2.0.3 OK
    configReplSet:
  • configReplSet/pmgo-pl204.int.compumark.com:27019 [P]: pbm-agent v2.0.3 OK
  • configReplSet/pmgo-pl202.int.compumark.com:27019 [S]: pbm-agent v2.0.3 OK
  • configReplSet/pmgo-pl206.int.compumark.com:27029 [S]: pbm-agent v2.0.3 OK
    shard1ReplSet:
  • shard1ReplSet/pmgo-pl203.int.compumark.com:27018 [P]: pbm-agent v2.0.3 OK
  • shard1ReplSet/pmgo-pl201.int.compumark.com:27018 [S]: pbm-agent v2.0.3 OK
  • shard1ReplSet/pmgo-pl202.int.compumark.com:27028 [S]: pbm-agent v2.0.3 OK
    shard2ReplSet:
  • shard2ReplSet/pmgo-pl204.int.compumark.com:27018 [P]: pbm-agent v2.0.3 OK
  • shard2ReplSet/pmgo-pl202.int.compumark.com:27018 [S]: pbm-agent v2.0.3 OK
  • shard2ReplSet/pmgo-pl205.int.compumark.com:27028 [S]: pbm-agent v2.0.3 OK

PITR incremental backup:

Status [OFF]

Currently running:

Snapshot backup “2023-01-20T01:02:03Z”, started at 2023-01-20T01:02:03Z. Status: error. [op id: 63c9e80b530819b66cc42659]

Backups:

S3 us-east-1 s3://cm-mongo-db-shared-prod-va/pbm/backup/
Snapshots:
2023-01-20T01:02:03Z 15.16KB [ERROR: check cluster for dump done: convergeCluster: lost shard shard2ReplSet, last beat ts: 1674179663] [2023-01-20T01:54:57Z]
2023-01-19T02:18:10Z 3.06MB [ERROR: check cluster for dump done: convergeCluster: lost shard shard1ReplSet, last beat ts: 1674105416] [2023-01-19T05:17:27Z]
2023-01-10T05:13:31Z 2.22TB [restore_to_time: 2023-01-10T16:53:31Z]
2023-01-09T06:34:38Z 62.12KB <logical, selective> [restore_to_time: 2023-01-09T06:34:44Z]
2023-01-06T13:04:46Z 14.88GB <logical, selective> [restore_to_time: 2023-01-06T13:23:57Z]
PITR chunks [40.22GB]:
2023-01-10T16:53:32Z - 2023-01-15T06:03:03Z


pbm logs:

2023-01-20T01:02:04Z I [shard2ReplSet/pmgo-pl205.int.compumark.com:27028] [backup/2023-01-20T01:02:03Z] backup started
2023-01-20T01:02:04Z I [configReplSet/pmgo-pl206.int.compumark.com:27029] [backup/2023-01-20T01:02:03Z] backup started
2023-01-20T01:02:04Z I [shard1ReplSet/pmgo-pl202.int.compumark.com:27028] [backup/2023-01-20T01:02:03Z] backup started
2023-01-20T01:02:04Z I [shard3ReplSet/pmgo-pl201.int.compumark.com:27028] [backup/2023-01-20T01:02:03Z] backup started
2023-01-20T01:02:12Z I [configReplSet/pmgo-pl206.int.compumark.com:27029] [backup/2023-01-20T01:02:03Z] mongodump finished, waiting for the oplog
2023-01-20T01:54:54Z I [configReplSet/pmgo-pl206.int.compumark.com:27029] [backup/2023-01-20T01:02:03Z] dropping tmp collections
2023-01-20T01:54:57Z I [configReplSet/pmgo-pl206.int.compumark.com:27029] [backup/2023-01-20T01:02:03Z] mark RS as error check cluster for dump done: convergeCluster: lost shard shard2ReplSet, last beat ts: 1674179663:
2023-01-20T01:54:57Z I [configReplSet/pmgo-pl206.int.compumark.com:27029] [backup/2023-01-20T01:02:03Z] mark backup as error check cluster for dump done: convergeCluster: lost shard shard2ReplSet, last beat ts: 1674179663:
2023-01-20T01:54:57Z E [configReplSet/pmgo-pl206.int.compumark.com:27029] [backup/2023-01-20T01:02:03Z] backup: check cluster for dump done: convergeCluster: lost shard shard2ReplSet, last beat ts: 1674179663

1 Like

Hi team, Any updates? This is a showstopper for us. I would like to revert back to older version if no workaround.

1 Like

Hi @aranjith0,

Sorry for delay. It is better for you to open JIRA ticket next time.

It could be very different reasons. Mostly network lags between PBM Agent and direct mongod or configsvr primary mongod, or even PBM Agent and storage. Or logical bug in flow.

As a result, shard2ReplSet didn’t send a heartbeat in time (in 30 secs). I do not see the completed snapshot dump for shard1ReplSet from the logs also.

In v2.0 we changed backup format. In v1.x we store snapshot as a single file. Since v2.0, it splits each collection into a separate file. It allows backup (upload) and restore (download) snapshots concurrently. For each collection, a separate connection is opened. Also, now it is possible to run Selective Backup faster (download only needed collections. not whole snapshot file)

I suggest reviewing logs and backup meta (has heartbeats for each agent) of the failed backup. It could be the snapshot dump done later after the heartbeat timeout. Or you will see the actual reason why it failed.