Folks,
I’m seeing an error with the backup of one of our Percona XtraDB databases which is:
2025-06-30 00:03:48.591 INFO: RECV thread exiting 0: Success
2025-06-30 00:03:48.591 INFO: Terminating SST script
2025-06-30 00:03:49.244 INFO: [SST script] + md5_size=5094
2025-06-30 00:03:49.244 INFO: [SST script] + [[ 5094 =~ Object does not exist ]]
2025-06-30 00:03:49.244 INFO: [SST script] + (( md5_size < 23000 ))
2025-06-30 00:03:49.244 INFO: [SST script] + log ERROR 'Backup is empty'
2025-06-30 00:03:49.246 INFO: [SST script] 2025-06-30 00:03:49 [ERROR] Backup is empty
2025-06-30 00:03:49.246 INFO: [SST script] + log ERROR 'Backup was finished unsuccessfull'
2025-06-30 00:03:49.249 INFO: [SST script] 2025-06-30 00:03:49 [ERROR] Backup was finished unsuccessfull
2025-06-30 00:03:49.249 INFO: [SST script] + exit 1
2025-06-30 00:03:49.249 INFO: SST script ended by termination
2025-06-30 00:03:49.249 INFO: Closing send monitor...
2025-06-30 00:03:49.249 INFO: Closed send monitor.
2025-06-30 00:03:49.249 INFO: recv_thread() joined.
2025-06-30 00:03:49.249 INFO: Closing send queue.
2025-06-30 00:03:49.249 INFO: Donor no longer in donor state, sst-script finished.
2025-06-30 00:03:49.249 INFO: Closing receive queue.
2025-06-30 00:03:49.249 INFO: Exiting main loop
2025-06-30 00:03:49.249 INFO: Garbd returns 1
+ grep 'Will never receive state. Need to abort' /tmp/garbd.log
Terminating process2025-06-30 00:03:47.589 FATAL: ../../../../percona-xtradb-cluster-galera/gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1334: Will never receive state. Need to abort.
+ exit 1
I’ve posted additional information at Jira
Anyone have any suggestions on what the next step to debug the problem is?
Cheers,
Based on the logs you shared in the bug tracker, it seems like there was a network issue at the time when xbcloud was uploading chunks
2025-06-30 00:03:47.319 INFO: [SST script] 2025/06/30 00:03:47 socat[392] E write(1, 0x555d84a44000, 8192): Broken pipe
2025-06-30 00:03:47.319 INFO: [SST script] 250630 00:03:47 xbcloud: out-of-order chunk: real offset = 0x1e00000, expected offset = 0x15a5a34
Are you seeing the same errors each time you take a backup?
Yes,
that is for every single backup for that particular database.
One thing you can try is to set --curl-retriable-errors=1
under the backup.storages.s3.containerOptions.args.xbcloud
in your cr.yaml
file.
Tried it, didn’t seem to fix it, here’s a log excerpt to help identify what’s going on.
2025-07-16 00:03:33.359 INFO: [SST script] + xbcloud put --parallel=8 --storage=s3 --md5 --curl-retriable-errors=7 --curl-retriable-errors=1 --s3-bucket=redacted sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full.sst_info
2025-07-16 00:03:33.844 INFO: [SST script] 250716 00:03:33 xbcloud: Successfully connected.
2025-07-16 00:03:33.844 INFO: [SST script] 250716 00:03:33 xbcloud: [0] successfully uploaded chunk: sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full.sst_info/sst_info.00000000000000000000, size: 151
2025-07-16 00:03:33.844 INFO: [SST script] 250716 00:03:33 xbcloud: [0] successfully uploaded chunk: sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full.sst_info/sst_info.00000000000000000001, size: 22
2025-07-16 00:03:33.844 INFO: [SST script] 250716 00:03:33 xbcloud: Uploading md5
2025-07-16 00:03:33.844 INFO: [SST script] 250716 00:03:33 xbcloud: Upload failed: backup is incomplete.
2025-07-16 00:03:33.844 INFO: [SST script] Backup doesn't contain last file with prefix xtrabackup_tablespaces in the cloud storage
2025-07-16 00:03:33.845 INFO: [SST script] + (( 0 == 0 ))
2025-07-16 00:03:33.846 INFO: [SST script] + socat -u openssl-listen:4444,reuseaddr,cert=/etc/mysql/ssl-internal/tls.crt,key=/etc/mysql/ssl-internal/tls.key,cafile=/etc/mysql/ssl-internal/ca.crt,verify=1,retry=30 stdio
2025-07-16 00:03:33.846 INFO: [SST script] + grep -v 'error: http request failed: Couldn'\''t resolve host name'
2025-07-16 00:03:33.846 INFO: [SST script] ++ grep -c processor /proc/cpuinfo
2025-07-16 00:03:33.847 INFO: [SST script] + xbcloud put --storage=s3 --parallel=8 --md5 --curl-retriable-errors=7 --curl-retriable-errors=1 --s3-bucket=redacted sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full
2025-07-16 00:03:45.166 INFO: [SST script] 250716 00:03:33 xbcloud: Successfully connected.
2025-07-16 00:03:45.166 INFO: [SST script] 250716 00:03:45 xbcloud: [0] successfully uploaded chunk: sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full/voipmonitor/cdr_country_code#p#p250717.ibd.00000000000000000000, size: 51750
2025-07-16 00:03:45.655 INFO: [SST script] 250716 00:03:45 xbcloud: [0] successfully uploaded chunk: sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full/voipmonitor/cdr#p#p250707.ibd.00000000000000000000, size: 6688324
2025-07-16 00:03:45.655 INFO: [SST script] 250716 00:03:45 xbcloud: out-of-order chunk: real offset = 0xa00000, expected offset = 0x660079
2025-07-16 00:03:45.655 INFO: [SST script] 250716 00:03:45 xbcloud: [0] successfully uploaded chunk: sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full/voipmonitor/cdr_audio_transcribe#p#p250707.ibd.00000000000000000001, size: 60
2025-07-16 00:03:45.655 INFO: [SST script] 250716 00:03:45 xbcloud: [0] successfully uploaded chunk: sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full/ibdata1.00000000000000000002, size: 21
2025-07-16 00:03:45.655 INFO: [SST script] 2025/07/16 00:03:45 socat[392] E write(1, 0x564ec4a0b000, 8192): Broken pipe
2025-07-16 00:03:45.655 INFO: [SST script] 250716 00:03:45 xbcloud: [0] successfully uploaded chunk: sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full/ibdata1.00000000000000000001, size: 2097193
2025-07-16 00:03:45.655 INFO: [SST script] 250716 00:03:45 xbcloud: [0] successfully uploaded chunk: sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full/voipmonitor/register_time_info#p#p250312.ibd.00000000000000000000, size: 8613762
2025-07-16 00:03:45.655 INFO: [SST script] 250716 00:03:45 xbcloud: Uploading md5
2025-07-16 00:03:45.655 INFO: [SST script] 250716 00:03:45 xbcloud: Upload failed: backup is incomplete.
2025-07-16 00:03:45.655 INFO: [SST script] Backup doesn't contain last file with prefix xtrabackup_tablespaces in the cloud storage
2025-07-16 00:03:45.656 INFO: [SST script] + trap '' 15
2025-07-16 00:03:45.656 INFO: [SST script] + aws s3 ls s3://redacted/sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full.md5
2025-07-16 00:03:45.976 WARN: 3.0 (voipmonitor-db-pxc-db-pxc-2): State transfer to 0.0 (garb) failed: Invalid argument
2025-07-16 00:03:45.976 FATAL: ../../../../percona-xtradb-cluster-galera/gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1334: Will never receive state. Need to abort.
2025-07-16 00:03:45.976 INFO: gcomm: terminating thread
2025-07-16 00:03:45.976 INFO: gcomm: joining thread
2025-07-16 00:03:45.976 INFO: gcomm: closing backend
2025-07-16 00:03:46.385 INFO: [SST script] 2025-07-16 00:03:46 4582 voipmonitor-db-pxc-db-2025-07-16-00:00:52-full.md5
2025-07-16 00:03:46.525 INFO: [SST script] ++ aws --output json s3api list-objects --bucket redacted --prefix sql/voipmonitor/voipmonitor-db-pxc-db-2025-07-16-00:00:52-full.md5 --query 'Contents[0].Size'
2025-07-16 00:03:46.525 INFO: [SST script] ++ sed -e 's/.*"size":\([0-9]*\).*/\1/'
Where would I need to look next?
Cheers,
Can you share your cr.yaml file? Make sure to anonymize data prior to attaching here.
The CR is generated using the helm release and flux, so i’ll dump the generated CR from kubernetes.
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
annotations:
meta.helm.sh/release-name: voipmonitor-db
meta.helm.sh/release-namespace: default
labels:
app.kubernetes.io/instance: voipmonitor-db
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: pxc-db
app.kubernetes.io/version: 1.17.0
helm.sh/chart: pxc-db-1.17.0
helm.toolkit.fluxcd.io/name: voipmonitor-db
helm.toolkit.fluxcd.io/namespace: default
name: voipmonitor-db-pxc-db
namespace: default
spec:
backup:
image: percona/percona-xtradb-cluster-operator:1.17.0-pxc8.0-backup-pxb8.0.35
pitr:
enabled: true
resources:
limits: {}
requests: {}
storageName: binlogs
timeBetweenUploads: 300
timeoutSeconds: 60
schedule:
- keep: 360
name: every-six-hours
schedule: 0 */6 * * *
storageName: sql
storages:
binlogs:
s3:
bucket: redacted/sql/voipmonitor-binlogs/
credentialsSecret: voipmonitor-db-backup-user-creds
endpointUrl: https://s3.us-west-2.amazonaws.com
region: us-west-2
type: s3
sql:
containerOptions:
args:
xbcloud:
- --curl-retriable-errors=1
s3:
bucket: redacted/sql/voipmonitor/
credentialsSecret: voipmonitor-db-backup-user-creds
endpointUrl: https://s3.us-west-2.amazonaws.com
region: us-west-2
type: s3
crVersion: 1.17.0
enableCRValidationWebhook: false
haproxy:
affinity:
antiAffinityTopologyKey: kubernetes.io/hostname
annotations:
prometheus.io/port: "8404"
prometheus.io/scrape: "true"
enabled: true
exposePrimary:
annotations:
external-dns.alpha.kubernetes.io/hostname: db-voipmonitor.redacted.local
external-dns.alpha.kubernetes.io/ttl: "60"
kubernetes.io/ingress.class: redacted
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "3600"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: tcp
service.beta.kubernetes.io/aws-load-balancer-name: redacted
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internal
service.beta.kubernetes.io/aws-load-balancer-type: external
loadBalancerSourceRanges:
- 10.0.0.0/8
type: LoadBalancer
gracePeriod: 30
image: percona/haproxy:2.8.14
labels: {}
livenessDelaySec: 300
livenessProbes:
failureThreshold: 4
initialDelaySeconds: 60
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 5
nodeSelector: {}
podDisruptionBudget:
maxUnavailable: 1
readinessDelaySec: 15
readinessProbes:
failureThreshold: 3
initialDelaySeconds: 15
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits: {}
requests:
cpu: 600m
memory: 1G
sidecarPVCs: []
sidecarResources:
limits: {}
requests: {}
sidecarVolumes: []
sidecars: []
size: 3
tolerations: []
volumeSpec:
emptyDir: {}
logCollectorSecretName: voipmonitor-db-pxc-db-log-collector
logcollector:
enabled: true
image: percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0
resources:
limits: {}
requests:
cpu: 200m
memory: 100M
pause: false
pmm:
enabled: false
proxysql:
enabled: false
pxc:
affinity:
antiAffinityTopologyKey: kubernetes.io/hostname
annotations:
prometheus.io/port: "9104"
prometheus.io/scrape: "true"
autoRecovery: true
configuration: |2
[mysqld]
log_bin_trust_function_creators = 1
# https://www.voipmonitor.org/doc/How_to_migrate_database_to_partitions_or_optimize_performance
innodb_file_per_table = 1
# Enabled for performance per https://mariadb.com/kb/en/innodb-system-variables/#innodb_flush_log_at_trx_commit
innodb_flush_log_at_trx_commit = 2
gracePeriod: 600
image: percona/percona-xtradb-cluster:8.0.41-32.1
labels: {}
livenessDelaySec: 300
livenessProbes:
failureThreshold: 3
initialDelaySeconds: 300
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
nodeSelector:
topology.kubernetes.io/zone: us-west-2a
podDisruptionBudget:
maxUnavailable: 1
readinessDelaySec: 15
readinessProbes:
failureThreshold: 5
initialDelaySeconds: 15
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 15
resources:
limits:
cpu: 2000m
memory: 8G
requests:
cpu: 600m
memory: 4G
sidecarPVCs: []
sidecarResources:
limits: {}
requests: {}
sidecarVolumes: []
sidecars:
- args:
- --web.listen-address=0.0.0.0:9104
- --mysqld.username=monitor
- --mysqld.address=127.0.0.1:3306
- --tls.insecure-skip-verify
env:
- name: MYSQLD_EXPORTER_PASSWORD
valueFrom:
secretKeyRef:
key: monitor
name: voipmonitor-db-passwords
image: prom/mysqld-exporter
imagePullPolicy: Always
name: mysqld-exporter
size: 3
tolerations: []
volumeSpec:
persistentVolumeClaim:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: csi-gp3
secretsName: voipmonitor-db-passwords
sslInternalSecretName: voipmonitor-db-pxc-db-ssl-internal
sslSecretName: voipmonitor-db-pxc-db-ssl
tls:
enabled: true
issuerConf:
group: cert-manager.io
kind: ClusterIssuer
name: self-signed-cluster-issuer
updateStrategy: SmartUpdate
upgradeOptions:
apply: disabled
schedule: 0 4 * * *
versionServiceEndpoint: https://check.percona.com
vaultSecretName: voipmonitor-db-pxc-db-vault
@jrivera any further insights?
Having one PITR pod working writing to the S3 bucket for one database instance, and having it fail for on the second PITR pod for the second database using the exact same settings and permissions for the same S3 bucket is annoying.
voipmonitor-db-pxc-db-pitr-788f4566fd-wfpkm 1/1 Running 101 (5m5s ago) 59d
voipmonitor-db-pxc-db-pitr-788f4566fd-wfpkm 0/1 Error 101 (6m7s ago) 59d
voipmonitor-db-pxc-db-pitr-788f4566fd-wfpkm 0/1 CrashLoopBackOff 101 (13s ago) 59d
With
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 58m kubelet Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0-pxc8.0-backup-pxb8.0.35" in 1.507s (1.507s including waiting). Image size: 266275397 bytes.
Normal Pulled 57m kubelet Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0-pxc8.0-backup-pxb8.0.35" in 1.51s (1.51s including waiting). Image size: 266275397 bytes.
Normal Pulled 56m kubelet Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0-pxc8.0-backup-pxb8.0.35" in 1.49s (1.49s including waiting). Image size: 266275397 bytes.
Normal Created 55m (x92 over 39d) kubelet Created container pitr
Normal Started 55m (x92 over 39d) kubelet Started container pitr
Normal Pulled 55m kubelet Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0-pxc8.0-backup-pxb8.0.35" in 1.492s (1.492s including waiting). Image size: 266275397 bytes.
Normal Pulling 18m (x99 over 39d) kubelet Pulling image "percona/percona-xtradb-cluster-operator:1.17.0-pxc8.0-backup-pxb8.0.35"
Warning BackOff 3m21s (x1805 over 14d) kubelet Back-off restarting failed container pitr in pod voipmonitor-db-pxc-db-pitr-788f4566fd-wfpkm_default(3755e7bf-d9ec-4915-bce2-936803a18f37)
And the logs of
2025/08/18 14:02:42 checking binlog.000567 (668d524d-ef09-11ef-9a2d-23d8022b4ffc:479419643-479431830) against last uploaded set
2025/08/18 14:02:42 last uploaded 668d524d-ef09-11ef-9a2d-23d8022b4ffc:476366455-478158656 is not subset of 668d524d-ef09-11ef-9a2d-23d8022b4ffc:479419643-479431830 in binlog.000567 or vice versa
2025/08/18 14:02:42 checking binlog.000566 (668d524d-ef09-11ef-9a2d-23d8022b4ffc:479408575-479419642) against last uploaded set
2025/08/18 14:02:42 last uploaded 668d524d-ef09-11ef-9a2d-23d8022b4ffc:476366455-478158656 is not subset of 668d524d-ef09-11ef-9a2d-23d8022b4ffc:479408575-479419642 in binlog.000566 or vice versa
2025/08/18 14:02:42 checking binlog.000565 (668d524d-ef09-11ef-9a2d-23d8022b4ffc:478158657-479408574) against last uploaded set
2025/08/18 14:02:42 last uploaded 668d524d-ef09-11ef-9a2d-23d8022b4ffc:476366455-478158656 is not subset of 668d524d-ef09-11ef-9a2d-23d8022b4ffc:478158657-479408574 in binlog.000565 or vice versa
2025/08/18 14:02:42 checking binlog.000564 (668d524d-ef09-11ef-9a2d-23d8022b4ffc:476366455-478158656) against last uploaded set
2025/08/18 14:02:42 last uploaded 668d524d-ef09-11ef-9a2d-23d8022b4ffc:476366455-478158656 is equal to 668d524d-ef09-11ef-9a2d-23d8022b4ffc:476366455-478158656 in binlog.000564
2025/08/18 14:02:42 last uploaded binlog: binlog.000564
2025/08/18 14:02:42 starting to process binlog with name binlog.000565
2025/08/18 14:03:03 successfully wrote binlog file binlog.000565 to storage with name binlog_1752946151_225a2b16e20c72de50caf5cd02634272
2025/08/18 14:03:07 starting to process binlog with name binlog.000566
2025/08/18 14:03:07 successfully wrote binlog file binlog.000566 to storage with name binlog_1752980880_fd1b65834576e34fe8589d8c2d3ddcb8
2025/08/18 14:03:07 starting to process binlog with name binlog.000567
2025/08/18 14:03:08 successfully wrote binlog file binlog.000567 to storage with name binlog_1752981228_8fe77c6abb9de6e9c8957960dd5ceab6
2025/08/18 14:03:08 starting to process binlog with name binlog.000568
2025/08/18 14:03:09 successfully wrote binlog file binlog.000568 to storage with name binlog_1752981550_531103abd6c653d92c64dca4a879483d
2025/08/18 14:03:09 starting to process binlog with name binlog.000569
2025/08/18 14:03:09 successfully wrote binlog file binlog.000569 to storage with name binlog_1752981872_6a032b6c4656f92eeca0e4dea355fbad
2025/08/18 14:03:10 starting to process binlog with name binlog.000570
2025/08/18 14:03:10 successfully wrote binlog file binlog.000570 to storage with name binlog_1752982193_b2ee4699c56cdfaf1f25444becc8a514
2025/08/18 14:03:10 starting to process binlog with name binlog.000571
2025/08/18 14:03:11 successfully wrote binlog file binlog.000571 to storage with name binlog_1752982515_d07eecbc461fef379b73117db14ce741
2025/08/18 14:03:11 starting to process binlog with name binlog.000572
2025/08/18 14:03:11 successfully wrote binlog file binlog.000572 to storage with name binlog_1752982837_5901903bde043eb292525bc5dacd64ec
2025/08/18 14:03:11 starting to process binlog with name binlog.000573
2025/08/18 14:03:12 successfully wrote binlog file binlog.000573 to storage with name binlog_1752983159_6853b041f4792d2e760cccc53c621203
2025/08/18 14:03:12 starting to process binlog with name binlog.000574
2025/08/18 14:03:12 successfully wrote binlog file binlog.000574 to storage with name binlog_1752983481_31e77e814336b37e5da4b07ad97d6448
2025/08/18 14:03:12 starting to process binlog with name binlog.000575
2025/08/18 14:03:13 successfully wrote binlog file binlog.000575 to storage with name binlog_1752983803_13049639ac841b809d46c53bd841ba78
2025/08/18 14:03:13 starting to process binlog with name binlog.000576
2025/08/18 14:03:13 successfully wrote binlog file binlog.000576 to storage with name binlog_1752984125_d5f2d4022d038803dac39633bd9a33a8
2025/08/18 14:03:13 starting to process binlog with name binlog.000577
2025/08/18 14:03:14 successfully wrote binlog file binlog.000577 to storage with name binlog_1752984447_7f97f166fe1490b8d9281014a5c35c8d
2025/08/18 14:03:14 starting to process binlog with name binlog.000578
2025/08/18 14:03:14 successfully wrote binlog file binlog.000578 to storage with name binlog_1752984768_74770ec5aa572e57c6bf7fc46bbab295
2025/08/18 14:03:14 starting to process binlog with name binlog.000579
2025/08/18 14:03:15 successfully wrote binlog file binlog.000579 to storage with name binlog_1752985089_4f5b09c61742673795f7b679387e72e7
2025/08/18 14:03:15 starting to process binlog with name binlog.000580
2025/08/18 14:03:15 successfully wrote binlog file binlog.000580 to storage with name binlog_1752985411_8706900605ec29d46fe53e9172565211
2025/08/18 14:03:15 starting to process binlog with name binlog.000581
2025/08/18 14:03:16 successfully wrote binlog file binlog.000581 to storage with name binlog_1752985733_c6d13905633a8a72215c7d30bb7a0fcd
2025/08/18 14:03:16 starting to process binlog with name binlog.000582
2025/08/18 14:03:16 successfully wrote binlog file binlog.000582 to storage with name binlog_1752986055_6f7e396c2d1672e1a1aa4627c4141927
2025/08/18 14:03:16 starting to process binlog with name binlog.000583
2025/08/18 14:03:40 ERROR: collect binlog files: manage binlog: put binlog.000583 object: put object sql/voipmonitor-binlogs/binlog_1752986376_d9379a2165d6e8d7ddb0b3a00a735808: context deadline exceeded