Percona-xtradb-cluster-operator cycles on deleting backup from s3

Hello,

I am very pleased on operators work so far how backups are handled and everything went ok until second cluster was added.

Current setup:
Openshift 3.11
image: ‘percona/percona-xtradb-cluster-operator:1.8.0’
pxc: image: 'percona/percona-xtradb-cluster:8.0.22-13.1
CRD (PerconaXtraDBCluster) - admin-db and customer-db

What happens is that operator is unable to update backup CRD and performs s3 file deletion again and after that is still unable to update backup CRD.
That loop never stops.
Untill all 10 workers are busy and operator is skipping backup deletion.

Logs from operator:

{"level":"info","ts":1637762709.6179812,"caller":"zapr/zapr.go:69","msg":"Created a new backup job","Namespace":"dev","Name":"xb-cron-dev-admin-db--s3-ionos-20211124140509-13hs9"}
{"level":"info","ts":1637762758.1214314,"caller":"zapr/zapr.go:69","msg":"deleting backup from s3","name":"cron-dev-admin-db--s3-ionos-20211121135515-364e4"}
{"level":"info","ts":1637762783.195445,"caller":"zapr/zapr.go:69","msg":"backup was removed from s3","name":"cron-dev-admin-db--s3-ionos-20211121135515-364e4"}
{"level":"error","ts":1637762783.2017963,"caller":"zapr/zapr.go:128","msg":"failed to update finalizers for backup","backup":"cron-dev-admin-db--s3-ionos-20211121135515-364e4","error":"Operation cannot be fulfilled on perconaxtradbclusterbackups.pxc.percona.com \"cron-dev-admin-db--s3-ionos-20211121135515-364e4\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcbackup.(*ReconcilePerconaXtraDBClusterBackup).runS3BackupFinalizer\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcbackup/controller.go:361"}
{"level":"info","ts":1637762784.1155515,"caller":"zapr/zapr.go:69","msg":"deleting backup from s3","name":"cron-dev-admin-db--s3-ionos-20211121135515-364e4"}

Before adding second cluster in that namespace sometimes i saw such log error about updating backup but it never stayed in loop.

1 Like

Hello @iadv ,

thank you for submitting this.
We have dropped the support for openshift 3.11 this year as it is quite a legacy. Could you please share full steps to reproduce this issue and we will try to do it on any other kubernetes platform?

1 Like

Thanks @Sergey_Pronin
I am not aware exactly what happened since first cluster was running fine for quite a while.
And the second cluster was added just recently.

Like I said that previously i noticed that such errors exist that operator is unable to update CRD (I presume to remove s3 finalizer) because of not having the latest version.
But it was never an issue untill the second cluster was added.

I have tried to disable first or second cluster, tried to remove one or the other backup but the issue never resolved.

Current plan is to try to upgrade

  1. operator to percona/pxc-operator/1.9.1 (currently have percona/pxc-operator v0.1.12 - that is 1.8)
  2. pxc to percona/pxc-db v1.9.1 - 8.0.23-14.1 (currently have percona/pxc-db v0.1.17 - 8.0.22-13.1)

What i was looking for is some way to troubleshoot operator about what is triggering that unnecessary delete between initial command and upon successful deletion from s3.

Although there is no way to increase worker count. But i am not sure that it would change anything.

1 Like

HI!

I’m having exactly the same problem with 1.12.0.

Did you solve that?

Thanks

@Thiago_Rodines is it on openshift 3.11 too?

@Sergey_Pronin , it is on Kubernetes.

@Thiago_Rodines got it. Do you also have two clusters?
Is there any easy way to reproduce the prob?

Hello

We did not solve it with operator, but just created a way around it and implemented needed procedures.

Since main problem was that operator could not update CRD then i created a cron that just patches CRD directly and removes that directive.
After that operator is able to do the rest with that CRD and proceed with further operation.

And since I did not have a process that deleted files from S3, then a lifecycle policy was set up for that bucket.

Yes! It worked perfectly when I had just one cluster and now, that I have 3 clusters, I have problem with backup jobs.

To reproduce you just need to run 2 clusters and configure backup on both.

What I did to “fix” was set an operator for each cluster and now everything is ok.