Unexpected backup failure - convergeCluster lost shard?

We’ve got several other clusters (all simple 3-data-node PSS replica sets, no sharding) - backing up fine, but started adding new backups for a couple of clusters and are seeing errors like the following when the mongodump portion finishes:

I [backup/2021-01-05T21:27:41Z] mongodump finished, waiting for the oplog
I [backup/2021-01-05T21:27:41Z] mark backup as error `check cluster for dump done: convergeCluster: lost shard repl-c-guild-c04, last beat ts: 1609882109`: <nil>
E [backup/2021-01-05T21:27:41Z] backup: check cluster for dump done: convergeCluster: lost shard repl-c-guild-c04, last beat ts: 1609882109
D [backup/2021-01-05T21:27:41Z] releasing lock

Any idea what could be going on here?

There were no mongod issues during the backup, and oplog has stayed caught up on all nodes during the dump. (never more than a few seconds delay at most)

This is with 1.4.0.

1 Like

Interesting, on at least one these, when I started testing for a different issue (desire for throttling)

after applying that throttling, it was able to complete the initial backups. Wonder if there is some sort of race condition?

1 Like

Hi @nneul

Can you share pbm status output of cluster(s) in question and pbm logs -e backup/2021-01-05T21:27:41Z (where 2021-01-05T21:27:41Z is the name of the failed backup(s))?

I have since gotten the backup to complete so can’t reproduce the failure now, but here’s the info:

Cluster:
========
repl-c-guild-c04:
  - repl-c-guild-c04/c-guild-c04-db01-s01.example.com:27017: pbm-agent v1.4.0 OK
  - repl-c-guild-c04/c-guild-c04-db02-s01.example.com:27017: pbm-agent v1.4.0 OK
  - repl-c-guild-c04/c-guild-c04-db03-s01.example.com:27017: pbm-agent v1.4.0 OK

PITR incremental backup:
========================
Status [ON]

Currently running:
==================
(none)

Backups:
========
S3 bhs https://s3.bhs.cloud.ovh.net/pbm-repl-c-guild-c04/pbm
  Snapshots:
    2021-01-07T13:00:01Z 21.66GB [complete: 2021-01-07T13:09:51]
    2021-01-07T02:17:33Z 23.29GB [complete: 2021-01-07T02:28:29]
  PITR chunks:
    2021-01-07T13:09:51 - 2021-01-07T20:14:50 578.89MB
    2021-01-07T02:28:29 - 2021-01-07T13:00:21 1.28GB

pbm logs -e produced no output, but I can get the pbm-agent logs from syslogs if that would work?

1 Like

Yes, pbm-agent logs from syslogs would work as well.
But it’s strange that “pbm logs -e …” shows nothing. Could you also try “pbm logs -e backup -t 0”?

1 Like

Syslogs in pastebin link above.

Only thing I’m seeing in the pbm logs -e backup -t 0 is a recent successful backup:

root@c-guild-c05-db01-s01:~# pbm logs -e backup -t 0
2021-01-07T02:35:44Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] backup started
2021-01-07T02:35:47Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] s3.uploadPartSize is set to 33554432 (~32Mb)
2021-01-07T02:43:31Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] mongodump finished, waiting for the oplog
2021-01-07T02:43:34Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] s3.uploadPartSize is set to 33554432 (~32Mb)
2021-01-07T02:45:04Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] s3.uploadPartSize is set to 33554432 (~32Mb)
2021-01-07T02:45:05Z I [repl-c-guild-c05/c-guild-c05-db03-s01.example.com:27017] [backup/2021-01-07T02:35:27Z] backup finished
1 Like

I see no issues in recent logs. So I suppose you don’t have that issue anymore. In case you bump into it again, can you post pbm status and pbm logs -e backup -t 0 output?
Logs in your initial message mean that the agent from repl-c-guild-c04 hadn’t been sending heartbeats for a while and pbm assumes it lost. But since it’s a non-sharded replica set - the node that has produced a log and the node in question are the same…
It is looking like a node was frozen for a while between mongodump finished, waiting for the oplog and the next check, so it wasn’t able to send heartbeats. And after the node unfroze, routine that making a check has had run before then the hb routine… I have to see whole logs (regarding failed backup) with dates and time from all agents and pbm status to understand how it could happen.

1 Like

I’m getting a similar error since upgrading to 1.4.0. Backups were fine before the upgrade but have all failed since.

1 Like

@taisph Since I can’t reproduce any more - can you possiblly get @AndrewPogrebnoi the equivalent to the logs he requested from me?

1 Like