Unexpected backup failure - convergeCluster lost shard?

We’ve got several other clusters (all simple 3-data-node PSS replica sets, no sharding) - backing up fine, but started adding new backups for a couple of clusters and are seeing errors like the following when the mongodump portion finishes:

I [backup/2021-01-05T21:27:41Z] mongodump finished, waiting for the oplog
I [backup/2021-01-05T21:27:41Z] mark backup as error `check cluster for dump done: convergeCluster: lost shard repl-c-guild-c04, last beat ts: 1609882109`: <nil>
E [backup/2021-01-05T21:27:41Z] backup: check cluster for dump done: convergeCluster: lost shard repl-c-guild-c04, last beat ts: 1609882109
D [backup/2021-01-05T21:27:41Z] releasing lock

Any idea what could be going on here?

There were no mongod issues during the backup, and oplog has stayed caught up on all nodes during the dump. (never more than a few seconds delay at most)

This is with 1.4.0.

1 Like

Interesting, on at least one these, when I started testing for a different issue (desire for throttling)

after applying that throttling, it was able to complete the initial backups. Wonder if there is some sort of race condition?

1 Like

Hi @Nathan_Neulinger

Can you share pbm status output of cluster(s) in question and pbm logs -e backup/2021-01-05T21:27:41Z (where 2021-01-05T21:27:41Z is the name of the failed backup(s))?

I have since gotten the backup to complete so can’t reproduce the failure now, but here’s the info:

Cluster:
========
repl-c-guild-c04:
  - repl-c-guild-c04/c-guild-c04-db01-s01.example.com:27017: pbm-agent v1.4.0 OK
  - repl-c-guild-c04/c-guild-c04-db02-s01.example.com:27017: pbm-agent v1.4.0 OK
  - repl-c-guild-c04/c-guild-c04-db03-s01.example.com:27017: pbm-agent v1.4.0 OK

PITR incremental backup:
========================
Status [ON]

Currently running:
==================
(none)

Backups:
========
S3 bhs https://s3.bhs.cloud.ovh.net/pbm-repl-c-guild-c04/pbm
  Snapshots:
    2021-01-07T13:00:01Z 21.66GB [complete: 2021-01-07T13:09:51]
    2021-01-07T02:17:33Z 23.29GB [complete: 2021-01-07T02:28:29]
  PITR chunks:
    2021-01-07T13:09:51 - 2021-01-07T20:14:50 578.89MB
    2021-01-07T02:28:29 - 2021-01-07T13:00:21 1.28GB

pbm logs -e produced no output, but I can get the pbm-agent logs from syslogs if that would work?

https://pastebin.com/haFXpb8R

1 Like

Yes, pbm-agent logs from syslogs would work as well.
But it’s strange that “pbm logs -e …” shows nothing. Could you also try “pbm logs -e backup -t 0”?

1 Like

Syslogs in pastebin link above.

Only thing I’m seeing in the pbm logs -e backup -t 0 is a recent successful backup:

root@c-guild-c05-db01-s01:~# pbm logs -e backup -t 0
2021-01-07T02:35:44Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] backup started
2021-01-07T02:35:47Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] s3.uploadPartSize is set to 33554432 (~32Mb)
2021-01-07T02:43:31Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] mongodump finished, waiting for the oplog
2021-01-07T02:43:34Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] s3.uploadPartSize is set to 33554432 (~32Mb)
2021-01-07T02:45:04Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] s3.uploadPartSize is set to 33554432 (~32Mb)
2021-01-07T02:45:05Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] backup finished
1 Like

I see no issues in recent logs. So I suppose you don’t have that issue anymore. In case you bump into it again, can you post pbm status and pbm logs -e backup -t 0 output?
Logs in your initial message mean that the agent from repl-c-guild-c04 hadn’t been sending heartbeats for a while and pbm assumes it lost. But since it’s a non-sharded replica set - the node that has produced a log and the node in question are the same…
It is looking like a node was frozen for a while between mongodump finished, waiting for the oplog and the next check, so it wasn’t able to send heartbeats. And after the node unfroze, routine that making a check has had run before then the hb routine… I have to see whole logs (regarding failed backup) with dates and time from all agents and pbm status to understand how it could happen.

1 Like

I’m getting a similar error since upgrading to 1.4.0. Backups were fine before the upgrade but have all failed since.

1 Like

@taisph Since I can’t reproduce any more - can you possiblly get @Andrew_Pogrebnoi the equivalent to the logs he requested from me?

1 Like

Seeing the same issue. Backups worked before on 1.3.2 (just this morning). I upgraded today to 1.4.1 and now they are failing.

The last successfull backup using 1.3.2 is 2021-03-25T00:22:00Z.

Any backup after upgrading to 1.4.1 fails with the same error:

ERROR: check cluster for dump done: convergeCluster: lost shard configrs

I didn’t change anything on the cluster configuration in the meantime. The only difference is the pbm version.

Command: pbm backup

Starting backup '2021-03-25T21:18:02Z'.....................
Backup '2021-03-25T21:18:02Z' to remote store 's3://app-mongodb' has started

Command: pbm status (Right afterwards)

Cluster:
========
configrs:
  - configrs/database_config-01:27017: pbm-agent v1.4.1 OK
  - configrs/database_config-02:27017: pbm-agent v1.4.1 OK
  - configrs/database_config-03:27017: pbm-agent v1.4.1 OK
shard1rs:
  - shard1rs/database_shard-01-01:27017: pbm-agent v1.4.1 OK
  - shard1rs/database_shard-01-02:27017: pbm-agent v1.4.1 OK
  - shard1rs/database_shard-01-03:27017: pbm-agent v1.4.1 OK

PITR incremental backup:
========================
Status [ON]

Currently running:
==================
(none)

Backups:
========
S3 eu-west-1 app-mongodb
  Snapshots:
    2021-03-25T21:18:02Z 0.00B [ERROR: check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616707099] [2021-03-25T21:18:50]
    2021-03-25T21:07:39Z 0.00B [ERROR: check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616706477] [2021-03-25T21:08:28]
    2021-03-25T20:56:04Z 0.00B [ERROR: check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616705781] [2021-03-25T20:56:52]
    2021-03-25T20:50:59Z 0.00B [ERROR: check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616705476] [2021-03-25T20:51:47]
    2021-03-25T00:22:00Z 4.85GB [complete: 2021-03-25T01:04:34]
    2021-03-24T00:54:28Z 4.85GB [complete: 2021-03-24T02:54:16]
    2021-03-23T00:53:44Z 4.85GB [complete: 2021-03-23T02:58:23]
    [... truncated for brevity – but daily backups for at least > 120 days]
  PITR chunks:
    2021-03-25T01:04:34 - 2021-03-25T21:07:59 2.70MB
    2021-03-24T02:54:16 - 2021-03-25T00:22:11 2.87MB
    2021-03-23T02:58:23 - 2021-03-24T00:54:44 2.85MB
    [... truncated for brevity – but PTR backups for at least > 120 days]

Command pbm logs

2021-03-25T21:18:19Z I [shard1rs/database_shard-01-02:27017] [backup/2021-03-25T21:18:02Z] backup started
2021-03-25T21:18:19Z I [configrs/database_config-02:27017] [pitr] got wake_up signal
2021-03-25T21:18:20Z I [shard1rs/database_shard-01-03:27017] [pitr] got wake_up signal
2021-03-25T21:18:23Z I [shard1rs/database_shard-01-02:27017] [backup/2021-03-25T21:18:02Z] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:23Z I [shard1rs/database_shard-01-03:27017] [pitr] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:23Z I [configrs/database_config-03:27017] [backup/2021-03-25T21:18:02Z] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:23Z I [configrs/database_config-03:27017] [pitr] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:23Z I [shard1rs/database_shard-01-02:27017] [pitr] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:24Z I [configrs/database_config-02:27017] [pitr] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:24Z I [configrs/database_config-03:27017] [pitr] created chunk 2021-03-25T21:08:01 - 2021-03-25T21:18:23
2021-03-25T21:18:24Z E [configrs/database_config-02:27017] [pitr] streaming oplog: unable to save chunk meta {configrs pbmPitr/configrs/20210325/20210325210801-2.20210325211823-6.oplog.snappy s2 {1616706481 2} {1616707103 6} 0}: multiple write errors: [{write errors: [{E11000 duplicate key error collection: admin.pbmPITRChunks index: rs_1_start_ts_1_end_ts_1 dup key: { rs: "configrs", start_ts: Timestamp(1616706481, 2), end_ts: Timestamp(1616707103, 6) }}]}, {<nil>}]
2021-03-25T21:18:24Z I [configrs/database_config-03:27017] [pitr] pausing/stopping with last_ts 2021-03-25 21:18:23 +0000 UTC
2021-03-25T21:18:24Z I [configrs/database_config-03:27017] [backup/2021-03-25T21:18:02Z] mongodump finished, waiting for the oplog
2021-03-25T21:18:26Z I [shard1rs/database_shard-01-02:27017] [pitr] created chunk 2021-03-25T21:07:59 - 2021-03-25T21:18:14
2021-03-25T21:18:26Z I [shard1rs/database_shard-01-02:27017] [pitr] pausing/stopping with last_ts 2021-03-25 21:18:14 +0000 UTC
2021-03-25T21:18:26Z E [shard1rs/database_shard-01-03:27017] [pitr] streaming oplog: unable to save chunk meta {shard1rs pbmPitr/shard1rs/20210325/20210325210759-2.20210325211814-51.oplog.snappy s2 {1616706479 2} {1616707094 51} 0}: multiple write errors: [{write errors: [{E11000 duplicate key error collection: admin.pbmPITRChunks index: rs_1_start_ts_1_end_ts_1 dup key: { rs: "shard1rs", start_ts: Timestamp(1616706479, 2), end_ts: Timestamp(1616707094, 51) }}]}, {<nil>}]
2021-03-25T21:18:49Z I [configrs/database_config-01:27017] [pitr] got wake_up signal
2021-03-25T21:18:50Z I [configrs/database_config-03:27017] [backup/2021-03-25T21:18:02Z] mark backup as error `check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616707099`: <nil>
2021-03-25T21:18:50Z E [configrs/database_config-03:27017] [backup/2021-03-25T21:18:02Z] backup: check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616707099
2021-03-25T21:18:53Z I [configrs/database_config-03:27017] [pitr] streaming started from 2021-03-25 21:18:23 +0000 UTC / 1616707103

This was the fourth time I tried. They all failed with the same error. But the duplicate key error only popped up as well in the last one.

1 Like

Tried again after some time, getting the same error on a different shard now:

2021-03-25T22:12:46Z I [shard1rs/database_shard-01-03:27017] [pitr] got wake_up signal
2021-03-25T22:12:47Z I [shard1rs/database_shard-01-02:27017] [pitr] got wake_up signal
2021-03-25T22:12:47Z I [shard1rs/database_shard-01-03:27017] [backup/2021-03-25T22:12:29Z] backup started
2021-03-25T22:12:47Z I [configrs/database_config-03:27017] [pitr] got wake_up signal
2021-03-25T22:12:47Z I [configrs/database_config-03:27017] [backup/2021-03-25T22:12:29Z] backup started
2021-03-25T22:12:50Z I [shard1rs/database_shard-01-03:27017] [backup/2021-03-25T22:12:29Z] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T22:12:50Z I [shard1rs/database_shard-01-03:27017] [pitr] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T22:12:51Z I [shard1rs/database_shard-01-02:27017] [pitr] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T22:12:51Z I [configrs/database_config-03:27017] [backup/2021-03-25T22:12:29Z] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T22:12:51Z I [configrs/database_config-03:27017] [pitr] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T22:12:52Z I [configrs/database_config-03:27017] [pitr] created chunk 2021-03-25T21:18:23 - 2021-03-25T22:12:51
2021-03-25T22:12:52Z I [configrs/database_config-03:27017] [pitr] pausing/stopping with last_ts 2021-03-25 22:12:51 +0000 UTC
2021-03-25T22:12:52Z I [configrs/database_config-03:27017] [backup/2021-03-25T22:12:29Z] mongodump finished, waiting for the oplog
2021-03-25T22:12:57Z I [shard1rs/database_shard-01-02:27017] [pitr] created chunk 2021-03-25T21:18:14 - 2021-03-25T22:12:46
2021-03-25T22:12:57Z E [shard1rs/database_shard-01-03:27017] [pitr] streaming oplog: unable to save chunk meta {shard1rs pbmPitr/shard1rs/20210325/20210325211814-51.20210325221246-4.oplog.snappy s2 {1616707094 51} {1616710366 4} 0}: multiple write errors: [{write errors: [{E11000 duplicate key error collection: admin.pbmPITRChunks index: rs_1_start_ts_1_end_ts_1 dup key: { rs: "shard1rs", start_ts: Timestamp(1616707094, 51), end_ts: Timestamp(1616710366, 4) }}]}, {<nil>}]
2021-03-25T22:12:57Z I [shard1rs/database_shard-01-02:27017] [pitr] pausing/stopping with last_ts 2021-03-25 22:12:46 +0000 UTC
2021-03-25T22:13:16Z I [configrs/database_config-01:27017] [pitr] got wake_up signal
2021-03-25T22:13:16Z I [shard1rs/database_shard-01-01:27017] [pitr] got wake_up signal
2021-03-25T22:13:17Z I [configrs/database_config-03:27017] [backup/2021-03-25T22:12:29Z] mark backup as error `check cluster for dump done: convergeCluster: lost shard shard1rs, last beat ts: 1616710366`: <nil>
2021-03-25T22:13:18Z E [configrs/database_config-03:27017] [backup/2021-03-25T22:12:29Z] backup: check cluster for dump done: convergeCluster: lost shard shard1rs, last beat ts: 1616710366

I wonder if the duplicate key error is the actual culprit. But I don’t want to touch the admin.pbmPITRChunks collection manually without getting some indication from the Percona team on this issue.

1 Like

Ok, for anybody with a similar issue.

I stopped PITR (Point in time recovery) via pbm config --set pitr.enabled=false

Then I tried another backup and it worked.

After enabling PITR again with pbm config --set pitr.enabled=true and running yet another backup it worked as well, this time with PITR enabled.

Maybe there is some bug that creates a conflict when upgrading from previous pbm versions with PITR enabled. Anyway, this issue is now resolved for me.

1 Like

Hi @jaschaio, sorry for the late answer.

These two errors shouldn’t be directly related. [pitr] streaming oplog: unable to save chunk meta looks like you have two pitr slicers running on two nodes on the replicaset at the same time. So when they a trying to save the same chunks some of them fails due to the indexes conflict. That is very odd and shouldn’t happen (I mean two slicers for the same replicaset). The backup: check cluster for dump done: convergeCluster: error means that the backup routine on one of the replicasets failed to make some progress and hadn’t been sending heartbeats for a while (could be due to node restart, network split, etc).

It’s been a long time. Is it still happening? Have you tried 1.5.0? Still there? If so, can you please send more info:

  1. Metadata of any failed backup. It is in the admin.pbmBackups collection on the config server. db.getSiblingDB("admin").pbmBackups.findOne({name: "2021-03-25T21:18:02Z"})
  2. More logs on backup in question: pbm logs -t 0 -s D -e 2021-03-25T21:18:02Z
  3. And more logs in general (some reasonable chunk of it). Like pbm logs -t 1000 -s D

Cheers!

1 Like