Error while trying backup: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts:

yevhenii.huzii · August 10, 2023, 8:57am

Description:

I have raised a mongodb cluster using the operator.
backup setup looks like this:

   backup:
     enabled: true
     image: perconalab/percona-server-mongodb-operator:main-backup
     petr:
       enabled: false
     tasks:
     -name: "daily-night-backup"
       enabled: true
       schedule: "0 16 * * *"
       keep: 14
       type: logical
       storageName: minio
       compressionType: none

When the backup starts automatically or when I run it manually, I get an error at the time of creating the backup:

check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691656133

Version:

percona-server-mongodb-operator: 1.12.0
percona-server-mongodb: 5.0.7-6
backup-agent: 2.2.1

Logs:

pbm status:

Cluster:
========
rs0:
  - rs0/mongodb-rs0-1.mongodb-rs0.infra.svc.k8s.us:27017 [P]: pbm-agent v2.2.1 OK
  - rs0/mongodb-rs0-2.mongodb-rs0.infra.svc.k8s.us:27017 [S]: pbm-agent v2.2.1 OK
  - rs0/mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017 [S]: pbm-agent v2.2.1 OK
cfg:
  - cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017 [P]: pbm-agent v2.2.1 OK
  - cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017 [S]: pbm-agent v2.2.1 OK
  - cfg/mongodb-cfg-2.mongodb-cfg.infra.svc.k8s.us:27017 [S]: pbm-agent v2.2.1 OK


PITR incremental backup:
========================
Status [OFF]

Currently running:
==================
(none)

Backups:
========
S3 us-east-1 s3://https://s3.us-west-004.backblazeb2.com/mongo-data-test
  Snapshots:
    2023-08-10T08:28:47Z 10.93KB <logical> [ERROR: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691656133] [2023-08-10T08:29:24Z]
    2023-08-09T16:00:33Z 10.93KB <logical> [ERROR: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691596839] [2023-08-09T16:01:10Z]
    2023-08-08T16:00:24Z 10.93KB <logical> [ERROR: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691510430] [2023-08-08T16:01:01Z]

but every time at the moment of backup, the node that performs the backup crashes with the above error

pbm logs -t 1000 -s D --event=backup/2023-08-10T08:28:47Z

2023-08-10T08:28:47Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] init backup meta
2023-08-10T08:28:47Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] nomination list for rs0: [[mongodb-rs0-2.mongodb-rs0.infra.svc.k8s.us:27017 mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017] [mong
odb-rs0-1.mongodb-rs0.infra.svc.k8s.us:27017]]
2023-08-10T08:28:47Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] nomination list for cfg: [[mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017 mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017 mongod
b-cfg-2.mongodb-cfg.infra.svc.k8s.us:27017]]
2023-08-10T08:28:47Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] nomination cfg, set candidates [mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017 mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017 m
ongodb-cfg-2.mongodb-cfg.infra.svc.k8s.us:27017]
2023-08-10T08:28:47Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] nomination rs0, set candidates [mongodb-rs0-2.mongodb-rs0.infra.svc.k8s.us:27017 mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017]
2023-08-10T08:28:48Z I [rs0/mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] backup started
2023-08-10T08:28:48Z D [rs0/mongodb-rs0-1.mongodb-rs0.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] skip after nomination, probably started by another node
2023-08-10T08:28:48Z I [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] backup started
2023-08-10T08:28:48Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] skip after nomination, probably started by another node
2023-08-10T08:28:48Z D [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] waiting for balancer off
2023-08-10T08:28:48Z D [cfg/mongodb-cfg-2.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] skip after nomination, probably started by another node
2023-08-10T08:28:48Z D [rs0/mongodb-rs0-2.mongodb-rs0.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] skip after nomination, probably started by another node
2023-08-10T08:28:48Z D [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] balancer status: off
2023-08-10T08:28:51Z D [rs0/mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] wait for tmp users {1691656131 10}
2023-08-10T08:28:52Z D [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] wait for tmp users {1691656132 8}
2023-08-10T08:28:52Z D [cfg/mongodb-cfg-1.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] bcp nomination: rs0 won by mongodb-rs0-0.mongodb-rs0.infra.svc.k8s.us:27017
2023-08-10T08:28:57Z I [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] mongodump finished, waiting for the oplog
2023-08-10T08:29:24Z I [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] dropping tmp collections
2023-08-10T08:29:24Z I [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] mark RS as error `check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691656133`: <nil>
2023-08-10T08:29:24Z I [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] mark backup as error `check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691656133`: <nil>
2023-08-10T08:29:24Z D [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] set balancer on
2023-08-10T08:29:24Z E [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] backup: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts: 1691656133
2023-08-10T08:29:24Z D [cfg/mongodb-cfg-0.mongodb-cfg.infra.svc.k8s.us:27017] [backup/2023-08-10T08:28:47Z] releasing lock

fabien.hannecart · November 8, 2024, 9:43am

Hello, we are having exactly the same problem! Did you manage to solve it please?

Ege_Gunes · November 8, 2024, 10:34am

Hey @yevhenii.huzii, @fabien.hannecart,

I have feeling that this could be because of memory resources. Could you check your limits for backup-agent sidecar? And could you check memory consumption during backup?

fabien.hannecart · November 8, 2024, 10:49am

Hello @Ege_Gunes,

Thanks for your reply.

For pod rs0-2 which is the pod that crashes when launching PBM, I can see the limits for the entire pod but I can’t see the limits for the pbm container with : kubectl get pod psmdb-db-staging-prep-rs0-2 -o jsonpath='{.spec.containers[?(@.name=="backup-agent")].resources}' {}%

Else for the pod :

    Limits:
      cpu:     3500m
      memory:  8Gi
    Requests:
      cpu:      2
      memory:   8Gi

fabien.hannecart · November 8, 2024, 10:55am

But indeed I see an OOMKilled on the pod ! with node 4 CPU and 12Go Ram !

I change the ram limit for the rs0 pods from 8gb to 9gb. But I have to keep a little for the system pods and the cfg-0 pod which has a limit of 2gb of ram on this same node. I will keep you informed at the next logical backup! (note that we have no error for a physical backup)

fabien.hannecart · November 9, 2024, 11:32pm

ok ! I understood my mistake , I had not put resource limits on the .Values.backup.resources (of psmdb-db chart).

Thanks very much for putting me on the right track @Ege_Gunes

Topic		Replies	Views
Unexpected backup failure - convergeCluster lost shard? Percona Backup for MongoDB	12	2349	July 26, 2021
Backup error for mongo percona operator Percona Backup for MongoDB	6	384	February 24, 2024
PBM backup failed Percona Backup for MongoDB mongodb	7	2180	November 30, 2021
Backup failed with ERROR: couldn’t get response from all shards Percona Backup for MongoDB	1	1212	May 8, 2023
Cannot take backup on sharded cluster Percona Backup for MongoDB	2	1361	June 11, 2020

Error while trying backup: check cluster for dump done: convergeCluster: lost shard rs0, last beat ts:

Description:

Version:

Logs:

Related topics