Automated backups randomly failing, repeating, until they finally succeed

Description:

Backup seem to be randomly failing

Steps to Reproduce:

Install operator with S3 backup storage, example backup section from the Helm chart.

    # https://docs.percona.com/percona-operator-for-mysql/pxc/backups-storage.html
    backup:
      enabled: true
      # image:
      #   tag: 1.15.0-pxc8.0-backup-pxb8.0.35-debug
      pitr:
        # PITR backups don't work yet, incorrect behaviour, so disabled for now
        enabled: false
        storageName: binlogs
        # time in seconds between uploads
        timeBetweenUploads: 300
      storages:
        # the normal backups
        sql:
          type: s3
          annotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/redacted-s3-backup-role
          s3:
            bucket: redacted/sql/voicecore/
            credentialsSecret: redacted
            region: redacted
        # the pitr binlogs for quick restore/replay
        binlogs:
          type: s3
          annotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/redacted-s3-backup-role
          s3:
            bucket: redacted/sql/voicecore-binlogs/
            credentialsSecret: redacted
            region: redacted
      schedule:
        - name: every-six-hours
          schedule: "0 */6 * * *"
          keep: 360
          storageName: sql

Version:

Chart: pxc-operator-1.15.0
Chart: pxc-db-1.15.0

Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.28.13-eks-a737599
WARNING: version difference between client (1.31) and server (1.28) exceeds the supported minor version skew of +/-1

Logs:

[If applicable, include any relevant log files or error messages]

Finishing logs

2024-10-09 12:01:43.769  INFO: PC protocol downgrade 1 -> 0
2024-10-09 12:01:43.769  INFO: Current view of cluster as seen by this node
view ((empty))
2024-10-09 12:01:43.769  INFO: gcomm: closed
2024-10-09 12:01:43.769  INFO: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2024-10-09 12:01:43.769  INFO: Flow-control interval: [9999999, 9999999]
2024-10-09 12:01:43.769  INFO: Received NON-PRIMARY.
2024-10-09 12:01:43.769  INFO: Shifting JOINER -> OPEN (TO: 2847)
2024-10-09 12:01:43.769  INFO: New SELF-LEAVE.
2024-10-09 12:01:43.769  INFO: Flow-control interval: [9999999, 9999999]
2024-10-09 12:01:43.769  INFO: Received SELF-LEAVE. Closing connection.
2024-10-09 12:01:43.769  INFO: Shifting OPEN -> CLOSED (TO: 2847)
2024-10-09 12:01:43.769  INFO: RECV thread exiting 0: Success
2024-10-09 12:01:43.769  INFO: SST script already terminated
2024-10-09 12:01:43.769  INFO: recv_thread() joined.
2024-10-09 12:01:43.769  INFO: Closing replication queue.
2024-10-09 12:01:43.769  INFO: Closing slave action queue.
2024-10-09 12:01:43.769  INFO: Exiting main loop
+ EXID_CODE=4
+ '[' -f /tmp/backup-is-completed ']'
+ log ERROR 'Backup was finished unsuccessfull'
Terminating processProcess completed with error: /usr/bin/run_backup.sh: 4 (Interrupted system call)2024-10-09 12:01:43 [ERROR] Backup was finished unsuccessfull
+ exit 4

Expected Result:

Every backup job is completed on first run.

Actual Result:

The outputs of the jobs

xb-cron-mysql-pxc-db-sql-20241076034-2jo18-bxjrb     0/1     Completed   0             14d
xb-cron-mysql-pxc-db-sql-20241080034-2jo18-7xnbv     0/1     Completed   0             13d
xb-cron-mysql-pxc-db-sql-202410812034-2jo18-gdx4b    0/1     Error       0             13d
xb-cron-mysql-pxc-db-sql-202410812034-2jo18-hcnpx    0/1     Error       0             13d
xb-cron-mysql-pxc-db-sql-202410812034-2jo18-k96zb    0/1     Error       0             13d
xb-cron-mysql-pxc-db-sql-202410812034-2jo18-w5q87    0/1     Error       0             13d
xb-cron-mysql-pxc-db-sql-202410812034-2jo18-zlnnr    0/1     Completed   0             13d
xb-cron-mysql-pxc-db-sql-202410818034-2jo18-c26p6    0/1     Completed   0             13d
xb-cron-mysql-pxc-db-sql-20241086034-2jo18-wkm9t     0/1     Completed   0             13d
xb-cron-mysql-pxc-db-sql-20241090034-2jo18-rjhrl     0/1     Completed   0             12d
xb-cron-mysql-pxc-db-sql-202410912034-2jo18-fkxzx    0/1     Error       0             12d
xb-cron-mysql-pxc-db-sql-202410912034-2jo18-rl8b7    0/1     Error       0             12d
xb-cron-mysql-pxc-db-sql-202410912034-2jo18-t29dn    0/1     Completed   0             12d
xb-cron-mysql-pxc-db-sql-202410918034-2jo18-m8csz    0/1     Completed   0             12d

Additional Information:

What additional information would be needed to figure out what is going on here.

Hi @Gerwin_van_de_Steeg , Could you please provide full logs from the backup pod?

Can do, here’s the redacted log
percona-failed-backup.log (169.6 KB)