Intermittent backup failures

rfaraj · May 29, 2023, 10:27am

Hi,

I have deployed mongo operator version 1.13.0. And we are having intermittent failures in the backup that is stored in S3.

In some cases it has failed due to a problem with the S3 certificate that I have solved, but in most cases, there are times that it works fine and others that it fails.

The backup agent logs do not show any errors and all the files are stored correctly in S3 even though the CRD marks it as failed.

The only error that appears is the following:

2023-05-28T00:15:00.746+0000 done dumping open010.Request (2644976 documents)
2023-05-28T00:15:00.746+0000 dump phase III: the oplog
2023-05-28T00:15:00.746+0000 finishing dump
2023-05-28T00:15:00.746+0000 Mux close namespace open010.Request
2023-05-28T00:15:00.746+0000 Mux finish
2023-05-28T00:15:00.746+0000 mux completed successfully
2023-05-28T00:15:02.000+0000 I [backup/2023-05-28T00:00:21Z] mongodump finished, waiting for the oplog
2023-05-28T00:15:05.000+0000 I [backup/2023-05-28T00:00:21Z] dropping tmp collections
2023-05-28T00:15:08.000+0000 I [backup/2023-05-28T00:00:21Z] mark RS as error waiting for dump done: backup stuck, last beat ts: 1685232241:
2023-05-28T00:15:11.000+0000 D [backup/2023-05-28T00:00:21Z] set balancer on
2023-05-28T00:15:11.000+0000 E [backup/2023-05-28T00:00:21Z] backup: waiting for dump done: backup stuck, last beat ts: 1685232241
2023-05-28T00:15:11.000+0000 D [backup/2023-05-28T00:00:21Z] releasing lock
2023-05-28T00:15:14.000+0000 D [pitr] start_catchup

Any idea?

Thanks and regards.

Sergey_Pronin · May 29, 2023, 12:57pm

Hello @rfaraj,

can you please show the full yaml of the object?

kubectl get psmdb-backup NAME -o yaml

rfaraj · May 29, 2023, 2:24pm

Hi @Sergey_Pronin ,

Here you have the output:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
creationTimestamp: “2023-05-28T00:00:00Z”
finalizers:

delete-backup
generateName: cron-siciu-bdsiciu-mo-20230528000000-
generation: 1
labels:
ancestor: daily-s3-us-west
cluster: siciu-bdsiciu-mongodb-pro
type: cron
name: cron-siciu-bdsiciu-mo-20230528000000-jckxs
namespace: siciu-bdsiciu-pro-ns
resourceVersion: “104811486”
uid: fadfce7d-6520-424c-856f-f0be13b598a2
spec:
clusterName: siciu-bdsiciu-mongodb-pro
compressionLevel: 6
compressionType: gzip
storageName: minio
status:
destination: “2023-05-28T00:00:21Z”
error: starting deadline exceeded
lastTransition: “2023-05-28T00:00:24Z”
pbmName: “2023-05-28T00:00:21Z”
replsetNames:
cfg
rs0
s3:
bucket: siciu-bdsiciu-pro
credentialsSecret: siciu-bdsiciu-backup-s3
endpointUrl: https://xxxxxxxx
insecureSkipTLSVerify: true
region: us-east-1
start: “2023-05-28T00:00:24Z”
state: error
storageName: minio

Thanks and regards.

Tomislav_Plavcic · May 31, 2023, 6:38am

Hi @rfaraj !
I believe it is this issue (still open): [K8SPSMDB-638] backup can fail with "starting deadline exceeded" even if it finishes in PBM - Percona JIRA
We already had one try in fixing the same issue, but it is still not fixed.
I’ll leave a comment with a link to this discussion and if you wish you can “follow” the ticket in Jira to get updates.

Kind regards,
Tomislav

Wreng991 · June 1, 2023, 1:52pm

Percona, software that doesn’t work some of the time™

I am really starting to feel regret that I decided on using Percona. First, there were too many logs generated in v1.13, making the entire thing unusable. Now backups don’t work! PMM Server has to become root, so it can’t run on openshift…

What is this man…

Please test your software, I don’t want to be a beta tester… This is wasting my time

Sergey_Pronin · June 5, 2023, 4:24am

@Wreng991 sad to hear that!

We are continously improving the Operator. As an open source project it evolves in the directions that community drives it.

If you have more issues in your list - please feel free to share, would be great to see what else can be improved.

@rfaraj ,

as for backups case - seems there is a race condition that leads to misleading error STATUS. The backups are safe, but status is wrong. We will address this in the next release.

qonalex · June 11, 2024, 7:21pm

@Sergey_Pronin Hello!
We just got the same issue.
How can we delete old backups manually from s3 in a safe way (without breaking the last backups)?
Which folders/files are safe to delete?

Topic		Replies	Views
Percona MongoDB operator backup failure Percona Operator for MongoDB percona , mongodb	4	1290	October 18, 2022
All Backups Error Percona Operator for MongoDB	4	1322	March 21, 2022
Percona mongodb operator backup error Percona Operator for MongoDB	1	466	February 21, 2024
Can't get backup working (operator crashes) Percona Operator for MongoDB closed-no-reply	0	801	September 1, 2021
Backup Error for Percona Operator for MongoDB: error: starting deadline exceeded Percona Backup for MongoDB pmm , percona , mongodb	2	1138	May 7, 2023

Intermittent backup failures

Related topics