Error while trying backup: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts:

Description:

I have raised a mongodb cluster using the operator.
backup setup looks like this:

   backup:
     enabled: true
     image: perconalab/percona-server-mongodb-operator:main-backup
     petr:
       enabled: false
     tasks:
     -name: "daily-night-backup"
       enabled: true
       schedule: "0 16 * * *"
       keep: 14
       type: logical
       storageName: minio
       compressionType: none

When the backup starts automatically or when I run it manually, I get an error at the time of creating the backup:

check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691656133

Version:

percona-server-mongodb-operator: 1.12.0
percona-server-mongodb: 5.0.7-6
backup-agent: 2.2.1

Logs:

pbm status:

Cluster:
========
rs0:
  - rs0/mongodb-rs0-1.mongodb-rs0.infra.svc.k8s.us:27017 [P]: pbm-agent v2.2.1 OK
  - rs0/mongodb-rs0-2.mongodb-rs0.infra.svc.k8s.us:27017 [S]: pbm-agent v2.2.1 OK
  - rs0/mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017 [S]: pbm-agent v2.2.1 OK
cfg:
  - cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017 [P]: pbm-agent v2.2.1 OK
  - cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017 [S]: pbm-agent v2.2.1 OK
  - cfg/mongodb-cfg-2.mongodb-cfg.infra.svc.k8s.us:27017 [S]: pbm-agent v2.2.1 OK


PITR incremental backup:
========================
Status [OFF]

Currently running:
==================
(none)

Backups:
========
S3 us-east-1 s3://https://s3.us-west-004.backblazeb2.com/mongo-data-test
  Snapshots:
    2023-08-10T08:28:47Z 10.93KB <logical> [ERROR: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691656133] [2023-08-10T08:29:24Z]
    2023-08-09T16:00:33Z 10.93KB <logical> [ERROR: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691596839] [2023-08-09T16:01:10Z]
    2023-08-08T16:00:24Z 10.93KB <logical> [ERROR: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691510430] [2023-08-08T16:01:01Z]

but every time at the moment of backup, the node that performs the backup crashes with the above error

pbm logs -t 1000 -s D --event=backup/2023-08-10T08:28:47Z

2023-08-10T08:28:47Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] init backup meta
2023-08-10T08:28:47Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] nomination list for rs0: [[mongodb-rs0-2.mongodb-rs0.infra.svc.k8s.us:27017 mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017] [mong
odb-rs0-1.mongodb-rs0.infra.svc.k8s.us:27017]]
2023-08-10T08:28:47Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] nomination list for cfg: [[mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017 mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017 mongod
b-cfg-2.mongodb-cfg.infra.svc.k8s.us:27017]]
2023-08-10T08:28:47Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] nomination cfg, set candidates [mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017 mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017 m
ongodb-cfg-2.mongodb-cfg.infra.svc.k8s.us:27017]
2023-08-10T08:28:47Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] nomination rs0, set candidates [mongodb-rs0-2.mongodb-rs0.infra.svc.k8s.us:27017 mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017]
2023-08-10T08:28:48Z I [rs0/mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] backup started
2023-08-10T08:28:48Z D [rs0/mongodb-rs0-1.mongodb-rs0.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] skip after nomination, probably started by another node
2023-08-10T08:28:48Z I [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] backup started
2023-08-10T08:28:48Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] skip after nomination, probably started by another node
2023-08-10T08:28:48Z D [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] waiting for balancer off
2023-08-10T08:28:48Z D [cfg/mongodb-cfg-2.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] skip after nomination, probably started by another node
2023-08-10T08:28:48Z D [rs0/mongodb-rs0-2.mongodb-rs0.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] skip after nomination, probably started by another node
2023-08-10T08:28:48Z D [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] balancer status: off
2023-08-10T08:28:51Z D [rs0/mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] wait for tmp users {1691656131 10}
2023-08-10T08:28:52Z D [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] wait for tmp users {1691656132 8}
2023-08-10T08:28:52Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] bcp nomination: rs0 won by mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017
2023-08-10T08:28:57Z I [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] mongodump finished, waiting for the oplog
2023-08-10T08:29:24Z I [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] dropping tmp collections
2023-08-10T08:29:24Z I [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] mark RS as error `check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691656133`: <nil>
2023-08-10T08:29:24Z I [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] mark backup as error `check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691656133`: <nil>
2023-08-10T08:29:24Z D [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] set balancer on
2023-08-10T08:29:24Z E [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] backup: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691656133
2023-08-10T08:29:24Z D [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] releasing lock

Hello, we are having exactly the same problem! Did you manage to solve it please?

Hey @yevhenii.huzii, @fabien.hannecart,

I have feeling that this could be because of memory resources. Could you check your limits for backup-agent sidecar? And could you check memory consumption during backup?

Hello @Ege_Gunes,

Thanks for your reply.

For pod rs0-2 which is the pod that crashes when launching PBM, I can see the limits for the entire pod but I can’t see the limits for the pbm container with : kubectl get pod psmdb-db-staging-prep-rs0-2 -o jsonpath='{.spec.containers[?(@.name=="backup-agent")].resources}' {}%

Else for the pod :

    Limits:
      cpu:     3500m
      memory:  8Gi
    Requests:
      cpu:      2
      memory:   8Gi

But indeed I see an OOMKilled on the pod ! with node 4 CPU and 12Go Ram !

I change the ram limit for the rs0 pods from 8gb to 9gb. But I have to keep a little for the system pods and the cfg-0 pod which has a limit of 2gb of ram on this same node. I will keep you informed at the next logical backup! (note that we have no error for a physical backup)

ok ! I understood my mistake :sweat_smile: , I had not put resource limits on the .Values.backup.resources (of psmdb-db chart).

Thanks very much for putting me on the right track @Ege_Gunes :pray:

1 Like