Backup pods intermittently fail. When looking at the logs, all of the failed ones have the message:
INFO: Donor no longer in donor state, interrupting script
Than we see a couple of lines ahead:
INFO: [SST script] ++ handle_sigterm
Terminating processProcess completed with error: /usr/bin/run_backup.sh: 4 (Interrupted system call)2023-08-04 11:14:14 [ERROR] Backup was finished unsuccessfull
This happens sporadically and randomly like 1 every 30 times. Such pods that terminated with errors linger in etcd (and their corresponding jobs as well).
Not sure if it has anything to do with it but the PXC resource was deployed with Argo CD. No changes have been made to the manifest so there is no reason to believe this has anything to do with the error. Just mentioning.
We are facing the same problem and we also deployed PXC through Argo CD. The backup itself succeeds after a few tries as the backups are retried by the CronJob operator of kubernetes. We experience failure in about every 4th or 5th backup.
Thanks for sharing.
Can you please share versions and steps to reproduce this. I have experience with Argo CD, so no need to go to the super detailed level, but at least your setup details would help: k8s version, operator version, backup storage type (S3, GCS, other).
Is the backup that you are taking scheduled? Or is it on demand?
Hi! We are using the following versions:
PXC-Operator: 1.13.0 (sha256:c674d63242f1af521edfbaffae2ae02fb8d010c0557a67a9c42d2b4a50db5243)
[Installed through your helm chart version 1.13.1]
The backups are scheduled and uploaded to S3 (AWS). On demand backups succeed all the time so far. But we are not doing a lot of them therefore this may not be representative.
If you need any further details feel free to reach out.
I’m running Kubernetes version v1.23.6+k3s1 using Civo’s k8s service.
The problems occur during the scheduled backups and we’re using AWS S3 as the backend.