Mongodb backup failing

Hello Team,

My Mongodb backup’s are failing from last 10 days and in between only one got succeeded. Can you help us on what could be the issue or workaround to fix it?

Details:

  1. Mongodb operator and mongo cluster installed on kubernetes cluster.
  2. Enabled regular backup’s which store backup’s to s3.

Versions: (both operator and cluster)
helm: 1.16.2
Appversion: 1.16.1

error in backup agent:

2024-08-26T00:00:38.680+0000 mux completed successfully

2024-08-26T00:00:39.000+0000 I [backup/2024-08-26T00:00:21Z] mongodump finished, waiting for the oplog

2024-08-26T00:01:33.000+0000 I [backup/2024-08-26T00:00:21Z] dropping tmp collections

2024-08-26T00:01:33.000+0000 I [backup/2024-08-26T00:00:21Z] created chunk 2024-08-26T00:00:23 - 2024-08-26T00:01:33

2024-08-26T00:01:33.000+0000 I [backup/2024-08-26T00:00:21Z] mark RS as error `check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1724630462`: <nil>

2024-08-26T00:01:33.000+0000 I [backup/2024-08-26T00:00:21Z] mark backup as error `check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1724630462`: <nil>

2024-08-26T00:01:33.000+0000 D [backup/2024-08-26T00:00:21Z] set balancer on

2024-08-26T00:01:33.000+0000 E [backup/2024-08-26T00:00:21Z] backup: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1724630462

logs when backup is succesfull:

2024-08-25T00:00:38.760+0000 mux completed successfully

2024-08-25T00:00:38.000+0000 I [backup/2024-08-25T00:00:21Z] mongodump finished, waiting for the oplog

2024-08-25T00:01:26.000+0000 I [backup/2024-08-25T00:00:21Z] created chunk 2024-08-25T00:00:23 - 2024-08-25T00:01:25

2024-08-25T00:01:32.000+0000 I [backup/2024-08-25T00:00:21Z] dropping tmp collections

2024-08-25T00:01:34.000+0000 D [backup/2024-08-25T00:00:21Z] set balancer on

2024-08-25T00:01:34.000+0000 I [backup/2024-08-25T00:00:21Z] backup finished

2024-08-25T00:01:34.000+0000 D [backup/2024-08-25T00:00:21Z] releasing lock

Just to add more info, In my s3 buckets I still see that backup’s are getting created. However, backup is still getting marked as error. Also, I may not be able to confirm if the backup’s are fully taken.

it seems your cluster is unhealthy or at least backup agent cannot get a response from rs0. I suggest doing a rolling restart of pods and retry