Backup pods fail with "Donor no longer in donor state"

Backup pods intermittently fail. When looking at the logs, all of the failed ones have the message:
INFO: Donor no longer in donor state, interrupting script

Than we see a couple of lines ahead: INFO: [SST script] ++ handle_sigterm

And, finally:

Terminating processProcess completed with error: /usr/bin/run_backup.sh: 4 (Interrupted system call)2023-08-04 11:14:14 [ERROR] Backup was finished unsuccessfull

This happens sporadically and randomly like 1 every 30 times. Such pods that terminated with errors linger in etcd (and their corresponding jobs as well).

Not sure if it has anything to do with it but the PXC resource was deployed with Argo CD. No changes have been made to the manifest so there is no reason to believe this has anything to do with the error. Just mentioning.

We are facing the same problem and we also deployed PXC through Argo CD. The backup itself succeeds after a few tries as the backups are retried by the CronJob operator of kubernetes. We experience failure in about every 4th or 5th backup.

Hey folks.

Thanks for sharing.
Can you please share versions and steps to reproduce this. I have experience with Argo CD, so no need to go to the super detailed level, but at least your setup details would help: k8s version, operator version, backup storage type (S3, GCS, other).

Is the backup that you are taking scheduled? Or is it on demand?

Hi! We are using the following versions:

K8S-Version: v1.24.13

PXC-Operator: 1.13.0 (sha256:c674d63242f1af521edfbaffae2ae02fb8d010c0557a67a9c42d2b4a50db5243)
[Installed through your helm chart version 1.13.1]

PXC-Version: percona/percona-xtradb-cluster:8.0.33-25.1
PXC-Backup-Version: percona/percona-xtradb-cluster-operator:1.13.0-pxc8.0-backup-pxb8.0.32

The backups are scheduled and uploaded to S3 (AWS). On demand backups succeed all the time so far. But we are not doing a lot of them therefore this may not be representative.

If you need any further details feel free to reach out.

I’m running Kubernetes version v1.23.6+k3s1 using Civo’s k8s service.

The problems occur during the scheduled backups and we’re using AWS S3 as the backend.

Got same issue :frowning:
K8s-version: v1.27.1
PXC-operator: 1.12.0
PXC-version: 8.0.29-21.1
PXC-backup: 1.12.0-pxc8.0-backup

Scheduled backup and uploaded to S3 minio.

Hello all.

Thank you all for sharing this issue.
I confirmed with our team and it seems that there is such a problem. We understand that it leaves Pods in Error state, but backups are still created.
It does not qualify as critical issue, but we will definately look into it (trying to locate the JIRA issue about it).

Please let me know if you do not agree on the criticality here.

1 Like