I am very pleased on operators work so far how backups are handled and everything went ok until second cluster was added.
Current setup:
Openshift 3.11
image: ‘percona/percona-xtradb-cluster-operator:1.8.0’
pxc: image: 'percona/percona-xtradb-cluster:8.0.22-13.1
CRD (PerconaXtraDBCluster) - admin-db and customer-db
What happens is that operator is unable to update backup CRD and performs s3 file deletion again and after that is still unable to update backup CRD.
That loop never stops.
Untill all 10 workers are busy and operator is skipping backup deletion.
Logs from operator:
{"level":"info","ts":1637762709.6179812,"caller":"zapr/zapr.go:69","msg":"Created a new backup job","Namespace":"dev","Name":"xb-cron-dev-admin-db--s3-ionos-20211124140509-13hs9"}
{"level":"info","ts":1637762758.1214314,"caller":"zapr/zapr.go:69","msg":"deleting backup from s3","name":"cron-dev-admin-db--s3-ionos-20211121135515-364e4"}
{"level":"info","ts":1637762783.195445,"caller":"zapr/zapr.go:69","msg":"backup was removed from s3","name":"cron-dev-admin-db--s3-ionos-20211121135515-364e4"}
{"level":"error","ts":1637762783.2017963,"caller":"zapr/zapr.go:128","msg":"failed to update finalizers for backup","backup":"cron-dev-admin-db--s3-ionos-20211121135515-364e4","error":"Operation cannot be fulfilled on perconaxtradbclusterbackups.pxc.percona.com \"cron-dev-admin-db--s3-ionos-20211121135515-364e4\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcbackup.(*ReconcilePerconaXtraDBClusterBackup).runS3BackupFinalizer\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcbackup/controller.go:361"}
{"level":"info","ts":1637762784.1155515,"caller":"zapr/zapr.go:69","msg":"deleting backup from s3","name":"cron-dev-admin-db--s3-ionos-20211121135515-364e4"}
Before adding second cluster in that namespace sometimes i saw such log error about updating backup but it never stayed in loop.
thank you for submitting this.
We have dropped the support for openshift 3.11 this year as it is quite a legacy. Could you please share full steps to reproduce this issue and we will try to do it on any other kubernetes platform?
Thanks @Sergey_Pronin
I am not aware exactly what happened since first cluster was running fine for quite a while.
And the second cluster was added just recently.
Like I said that previously i noticed that such errors exist that operator is unable to update CRD (I presume to remove s3 finalizer) because of not having the latest version.
But it was never an issue untill the second cluster was added.
I have tried to disable first or second cluster, tried to remove one or the other backup but the issue never resolved.
Current plan is to try to upgrade
operator to percona/pxc-operator/1.9.1 (currently have percona/pxc-operator v0.1.12 - that is 1.8)
pxc to percona/pxc-db v1.9.1 - 8.0.23-14.1 (currently have percona/pxc-db v0.1.17 - 8.0.22-13.1)
What i was looking for is some way to troubleshoot operator about what is triggering that unnecessary delete between initial command and upon successful deletion from s3.
Although there is no way to increase worker count. But i am not sure that it would change anything.
We did not solve it with operator, but just created a way around it and implemented needed procedures.
Since main problem was that operator could not update CRD then i created a cron that just patches CRD directly and removes that directive.
After that operator is able to do the rest with that CRD and proceed with further operation.
And since I did not have a process that deleted files from S3, then a lifecycle policy was set up for that bucket.