Seeing the same issue. Backups worked before on 1.3.2
(just this morning). I upgraded today to 1.4.1
and now they are failing.
The last successfull backup using 1.3.2
is 2021-03-25T00:22:00Z
.
Any backup after upgrading to 1.4.1
fails with the same error:
ERROR: check cluster for dump done: convergeCluster: lost shard configrs
I didn’t change anything on the cluster configuration in the meantime. The only difference is the pbm version.
Command: pbm backup
Starting backup '2021-03-25T21:18:02Z'.....................
Backup '2021-03-25T21:18:02Z' to remote store 's3://app-mongodb' has started
Command: pbm status
(Right afterwards)
Cluster:
========
configrs:
- configrs/database_config-01:27017: pbm-agent v1.4.1 OK
- configrs/database_config-02:27017: pbm-agent v1.4.1 OK
- configrs/database_config-03:27017: pbm-agent v1.4.1 OK
shard1rs:
- shard1rs/database_shard-01-01:27017: pbm-agent v1.4.1 OK
- shard1rs/database_shard-01-02:27017: pbm-agent v1.4.1 OK
- shard1rs/database_shard-01-03:27017: pbm-agent v1.4.1 OK
PITR incremental backup:
========================
Status [ON]
Currently running:
==================
(none)
Backups:
========
S3 eu-west-1 app-mongodb
Snapshots:
2021-03-25T21:18:02Z 0.00B [ERROR: check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616707099] [2021-03-25T21:18:50]
2021-03-25T21:07:39Z 0.00B [ERROR: check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616706477] [2021-03-25T21:08:28]
2021-03-25T20:56:04Z 0.00B [ERROR: check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616705781] [2021-03-25T20:56:52]
2021-03-25T20:50:59Z 0.00B [ERROR: check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616705476] [2021-03-25T20:51:47]
2021-03-25T00:22:00Z 4.85GB [complete: 2021-03-25T01:04:34]
2021-03-24T00:54:28Z 4.85GB [complete: 2021-03-24T02:54:16]
2021-03-23T00:53:44Z 4.85GB [complete: 2021-03-23T02:58:23]
[... truncated for brevity – but daily backups for at least > 120 days]
PITR chunks:
2021-03-25T01:04:34 - 2021-03-25T21:07:59 2.70MB
2021-03-24T02:54:16 - 2021-03-25T00:22:11 2.87MB
2021-03-23T02:58:23 - 2021-03-24T00:54:44 2.85MB
[... truncated for brevity – but PTR backups for at least > 120 days]
Command pbm logs
2021-03-25T21:18:19Z I [shard1rs/database_shard-01-02:27017] [backup/2021-03-25T21:18:02Z] backup started
2021-03-25T21:18:19Z I [configrs/database_config-02:27017] [pitr] got wake_up signal
2021-03-25T21:18:20Z I [shard1rs/database_shard-01-03:27017] [pitr] got wake_up signal
2021-03-25T21:18:23Z I [shard1rs/database_shard-01-02:27017] [backup/2021-03-25T21:18:02Z] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:23Z I [shard1rs/database_shard-01-03:27017] [pitr] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:23Z I [configrs/database_config-03:27017] [backup/2021-03-25T21:18:02Z] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:23Z I [configrs/database_config-03:27017] [pitr] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:23Z I [shard1rs/database_shard-01-02:27017] [pitr] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:24Z I [configrs/database_config-02:27017] [pitr] s3.uploadPartSize is set to 10485760 (~10Mb)
2021-03-25T21:18:24Z I [configrs/database_config-03:27017] [pitr] created chunk 2021-03-25T21:08:01 - 2021-03-25T21:18:23
2021-03-25T21:18:24Z E [configrs/database_config-02:27017] [pitr] streaming oplog: unable to save chunk meta {configrs pbmPitr/configrs/20210325/20210325210801-2.20210325211823-6.oplog.snappy s2 {1616706481 2} {1616707103 6} 0}: multiple write errors: [{write errors: [{E11000 duplicate key error collection: admin.pbmPITRChunks index: rs_1_start_ts_1_end_ts_1 dup key: { rs: "configrs", start_ts: Timestamp(1616706481, 2), end_ts: Timestamp(1616707103, 6) }}]}, {<nil>}]
2021-03-25T21:18:24Z I [configrs/database_config-03:27017] [pitr] pausing/stopping with last_ts 2021-03-25 21:18:23 +0000 UTC
2021-03-25T21:18:24Z I [configrs/database_config-03:27017] [backup/2021-03-25T21:18:02Z] mongodump finished, waiting for the oplog
2021-03-25T21:18:26Z I [shard1rs/database_shard-01-02:27017] [pitr] created chunk 2021-03-25T21:07:59 - 2021-03-25T21:18:14
2021-03-25T21:18:26Z I [shard1rs/database_shard-01-02:27017] [pitr] pausing/stopping with last_ts 2021-03-25 21:18:14 +0000 UTC
2021-03-25T21:18:26Z E [shard1rs/database_shard-01-03:27017] [pitr] streaming oplog: unable to save chunk meta {shard1rs pbmPitr/shard1rs/20210325/20210325210759-2.20210325211814-51.oplog.snappy s2 {1616706479 2} {1616707094 51} 0}: multiple write errors: [{write errors: [{E11000 duplicate key error collection: admin.pbmPITRChunks index: rs_1_start_ts_1_end_ts_1 dup key: { rs: "shard1rs", start_ts: Timestamp(1616706479, 2), end_ts: Timestamp(1616707094, 51) }}]}, {<nil>}]
2021-03-25T21:18:49Z I [configrs/database_config-01:27017] [pitr] got wake_up signal
2021-03-25T21:18:50Z I [configrs/database_config-03:27017] [backup/2021-03-25T21:18:02Z] mark backup as error `check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616707099`: <nil>
2021-03-25T21:18:50Z E [configrs/database_config-03:27017] [backup/2021-03-25T21:18:02Z] backup: check cluster for dump done: convergeCluster: lost shard configrs, last beat ts: 1616707099
2021-03-25T21:18:53Z I [configrs/database_config-03:27017] [pitr] streaming started from 2021-03-25 21:18:23 +0000 UTC / 1616707103
This was the fourth time I tried. They all failed with the same error. But the duplicate key error only popped up as well in the last one.