Backup pods intermittently fail. When looking at the logs, all of the failed ones have the message: INFO: Donor no longer in donor state, interrupting script
Than we see a couple of lines ahead: INFO: [SST script] ++ handle_sigterm
And, finally:
Terminating processProcess completed with error: /usr/bin/run_backup.sh: 4 (Interrupted system call)2023-08-04 11:14:14 [ERROR] Backup was finished unsuccessfull
This happens sporadically and randomly like 1 every 30 times. Such pods that terminated with errors linger in etcd (and their corresponding jobs as well).
Not sure if it has anything to do with it but the PXC resource was deployed with Argo CD. No changes have been made to the manifest so there is no reason to believe this has anything to do with the error. Just mentioning.
We are facing the same problem and we also deployed PXC through Argo CD. The backup itself succeeds after a few tries as the backups are retried by the CronJob operator of kubernetes. We experience failure in about every 4th or 5th backup.
Thanks for sharing.
Can you please share versions and steps to reproduce this. I have experience with Argo CD, so no need to go to the super detailed level, but at least your setup details would help: k8s version, operator version, backup storage type (S3, GCS, other).
Is the backup that you are taking scheduled? Or is it on demand?
K8S-Version: v1.24.13
PXC-Operator: 1.13.0 (sha256:c674d63242f1af521edfbaffae2ae02fb8d010c0557a67a9c42d2b4a50db5243)
[Installed through your helm chart version 1.13.1]
PXC-Version: percona/percona-xtradb-cluster:8.0.33-25.1
PXC-Backup-Version: percona/percona-xtradb-cluster-operator:1.13.0-pxc8.0-backup-pxb8.0.32
The backups are scheduled and uploaded to S3 (AWS). On demand backups succeed all the time so far. But we are not doing a lot of them therefore this may not be representative.
If you need any further details feel free to reach out.
Thank you all for sharing this issue.
I confirmed with our team and it seems that there is such a problem. We understand that it leaves Pods in Error state, but backups are still created.
It does not qualify as critical issue, but we will definately look into it (trying to locate the JIRA issue about it).
Please let me know if you do not agree on the criticality here.