Intermittent backup failures

Hi,

I have deployed mongo operator version 1.13.0. And we are having intermittent failures in the backup that is stored in S3.

image

In some cases it has failed due to a problem with the S3 certificate that I have solved, but in most cases, there are times that it works fine and others that it fails.

The backup agent logs do not show any errors and all the files are stored correctly in S3 even though the CRD marks it as failed.

The only error that appears is the following:

2023-05-28T00:15:00.746+0000 done dumping open010.Request (2644976 documents)
2023-05-28T00:15:00.746+0000 dump phase III: the oplog
2023-05-28T00:15:00.746+0000 finishing dump
2023-05-28T00:15:00.746+0000 Mux close namespace open010.Request
2023-05-28T00:15:00.746+0000 Mux finish
2023-05-28T00:15:00.746+0000 mux completed successfully
2023-05-28T00:15:02.000+0000 I [backup/2023-05-28T00:00:21Z] mongodump finished, waiting for the oplog
2023-05-28T00:15:05.000+0000 I [backup/2023-05-28T00:00:21Z] dropping tmp collections
2023-05-28T00:15:08.000+0000 I [backup/2023-05-28T00:00:21Z] mark RS as error waiting for dump done: backup stuck, last beat ts: 1685232241:
2023-05-28T00:15:11.000+0000 D [backup/2023-05-28T00:00:21Z] set balancer on
2023-05-28T00:15:11.000+0000 E [backup/2023-05-28T00:00:21Z] backup: waiting for dump done: backup stuck, last beat ts: 1685232241
2023-05-28T00:15:11.000+0000 D [backup/2023-05-28T00:00:21Z] releasing lock
2023-05-28T00:15:14.000+0000 D [pitr] start_catchup

Any idea?

Thanks and regards.

Hello @rfaraj,

can you please show the full yaml of the object?

kubectl get psmdb-backup NAME -o yaml

Hi @Sergey_Pronin ,

Here you have the output:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
creationTimestamp: “2023-05-28T00:00:00Z”
finalizers:

  • delete-backup
    generateName: cron-siciu-bdsiciu-mo-20230528000000-
    generation: 1
    labels:
    ancestor: daily-s3-us-west
    cluster: siciu-bdsiciu-mongodb-pro
    type: cron
    name: cron-siciu-bdsiciu-mo-20230528000000-jckxs
    namespace: siciu-bdsiciu-pro-ns
    resourceVersion: “104811486”
    uid: fadfce7d-6520-424c-856f-f0be13b598a2
    spec:
    clusterName: siciu-bdsiciu-mongodb-pro
    compressionLevel: 6
    compressionType: gzip
    storageName: minio
    status:
    destination: “2023-05-28T00:00:21Z”
    error: starting deadline exceeded
    lastTransition: “2023-05-28T00:00:24Z”
    pbmName: “2023-05-28T00:00:21Z”
    replsetNames:
  • cfg
  • rs0
    s3:
    bucket: siciu-bdsiciu-pro
    credentialsSecret: siciu-bdsiciu-backup-s3
    endpointUrl: https://xxxxxxxx
    insecureSkipTLSVerify: true
    region: us-east-1
    start: “2023-05-28T00:00:24Z”
    state: error
    storageName: minio

Thanks and regards.

Hi @rfaraj !
I believe it is this issue (still open): [K8SPSMDB-638] backup can fail with "starting deadline exceeded" even if it finishes in PBM - Percona JIRA
We already had one try in fixing the same issue, but it is still not fixed.
I’ll leave a comment with a link to this discussion and if you wish you can “follow” the ticket in Jira to get updates.

Kind regards,
Tomislav

Percona, software that doesn’t work some of the time™

I am really starting to feel regret that I decided on using Percona. First, there were too many logs generated in v1.13, making the entire thing unusable. Now backups don’t work! PMM Server has to become root, so it can’t run on openshift…

What is this man…

Please test your software, I don’t want to be a beta tester… This is wasting my time

@Wreng991 sad to hear that!

We are continously improving the Operator. As an open source project it evolves in the directions that community drives it.

If you have more issues in your list - please feel free to share, would be great to see what else can be improved.

@rfaraj ,

as for backups case - seems there is a race condition that leads to misleading error STATUS. The backups are safe, but status is wrong. We will address this in the next release.