Description:
This happens frequently enough that it gets really annoying, out of the two database instances running they frequently enough crash with no easy to diagnose reason and cannot recover themselves leaving the pod in a CrashLoopBackoff
This looks like
mysql-pxc-db-pxc-0 4/4 Running 13 (5d17h ago) 59d
mysql-pxc-db-pxc-1 4/4 Running 0 59d
mysql-pxc-db-pxc-2 3/4 CrashLoopBackOff 46 (75s ago) 4h9m
And a describe of the pod in question has the following segment
pxc:
Container ID: containerd://7894352f5a174ba0a5201aeda313a26c7494d8ec3c200fd96fece476e5e8834c
Image: percona/percona-xtradb-cluster:8.0.41-32.1
Image ID: docker.io/percona/percona-xtradb-cluster@sha256:168ffb252d533b856a74820dea51c155bf5a8cb6a806a4d8a2e387ed7417a733
Ports: 3306/TCP, 4444/TCP, 4567/TCP, 4568/TCP, 33062/TCP, 33060/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
Command:
/var/lib/mysql/pxc-entrypoint.sh
Args:
mysqld
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 19 Aug 2025 01:08:51 +1200
Finished: Tue, 19 Aug 2025 01:09:38 +1200
Ready: False
Restart Count: 45
Limits:
cpu: 2
memory: 12G
Requests:
cpu: 600m
memory: 8G
Liveness: exec [/var/lib/mysql/liveness-check.sh] delay=300s timeout=5s period=10s #success=1 #failure=3
Readiness: exec [/var/lib/mysql/readiness-check.sh] delay=15s timeout=15s period=30s #success=1 #failure=5
Environment Variables from:
mysql-pxc-db-env-vars-pxc Secret Optional: true
Environment:
PXC_SERVICE: mysql-pxc-db-pxc-unready
MONITOR_HOST: %
MYSQL_ROOT_PASSWORD: <set to the key 'root' in secret 'internal-mysql-pxc-db'> Optional: false
XTRABACKUP_PASSWORD: <set to the key 'xtrabackup' in secret 'internal-mysql-pxc-db'> Optional: false
MONITOR_PASSWORD: <set to the key 'monitor' in secret 'internal-mysql-pxc-db'> Optional: false
LOG_DATA_DIR: /var/lib/mysql
IS_LOGCOLLECTOR: yes
CLUSTER_HASH: 4212372
OPERATOR_ADMIN_PASSWORD: <set to the key 'operator' in secret 'internal-mysql-pxc-db'> Optional: false
LIVENESS_CHECK_TIMEOUT: 5
READINESS_CHECK_TIMEOUT: 15
DEFAULT_AUTHENTICATION_PLUGIN: caching_sha2_password
MYSQL_NOTIFY_SOCKET: /var/lib/mysql/notify.sock
MYSQL_STATE_FILE: /var/lib/mysql/mysql.state
And the following status
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m25s default-scheduler Successfully assigned default/mysql-pxc-db-pxc-2 to ip-10-1-0-193.ap-southeast-2.compute.internal
Normal Pulling 2m22s kubelet Pulling image "percona/percona-xtradb-cluster-operator:1.17.0"
Normal Pulled 2m20s kubelet Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0" in 1.535s (1.535s including waiting). Image size: 87993197 bytes.
Normal Created 2m20s kubelet Created container pxc-init
Normal Started 2m20s kubelet Started container pxc-init
Normal Pulling 2m17s kubelet Pulling image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0"
Normal Pulled 2m16s kubelet Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0" in 1.544s (1.544s including waiting). Image size: 136426815 bytes.
Normal Created 2m16s kubelet Created container logs
Normal Started 2m16s kubelet Started container logs
Normal Pulling 2m16s kubelet Pulling image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0"
Normal Pulled 2m14s kubelet Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0" in 1.487s (1.487s including waiting). Image size: 136426815 bytes.
Normal Created 2m14s kubelet Created container logrotate
Normal Started 2m14s kubelet Started container logrotate
Normal Pulled 2m13s kubelet Successfully pulled image "percona/percona-xtradb-cluster:8.0.41-32.1" in 1.559s (1.559s including waiting). Image size: 211194772 bytes.
Normal Pulling 2m13s kubelet Pulling image "prom/mysqld-exporter"
Normal Pulled 2m11s kubelet Successfully pulled image "prom/mysqld-exporter" in 1.494s (1.495s including waiting). Image size: 10954979 bytes.
Normal Created 2m11s kubelet Created container mysqld-exporter
Normal Started 2m11s kubelet Started container mysqld-exporter
Warning Unhealthy 112s kubelet Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on '10.1.0.78:33062' (111)
+ [[ '' == \P\r\i\m\a\r\y ]]
+ exit 1
Normal Pulling 85s (x2 over 2m14s) kubelet Pulling image "percona/percona-xtradb-cluster:8.0.41-32.1"
Normal Created 84s (x2 over 2m13s) kubelet Created container pxc
Normal Started 84s (x2 over 2m13s) kubelet Started container pxc
Normal Pulled 84s kubelet Successfully pulled image "percona/percona-xtradb-cluster:8.0.41-32.1" in 1.516s (1.516s including waiting). Image size: 211194772 bytes.
Another similar issue was posted but never resolved Pxc-db cluster unable to recover after crash - #5 by Michael_Coburn
Steps to Reproduce:
It’s intermittent and not easy to reproduce
Let a workload run for 60+ days receiving 50-100 rps across a very simple schema (there aren’t even foreign key relationships between tables), and it might crash.. might not.. could crash early, could not crash.
Version:
Operator: 1.17.0
Container: percona/percona-xtradb-cluster:8.0.41-32.1
Logs:
Log container in the Pod with the reboot loop
{"log":"2025-08-18T13:52:15.678791Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:16.677708Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:17.177883Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:18.177639Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:18.677776Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:19.677904Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:20.178119Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.177908Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.180756Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.180790Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node\nview ((empty))\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.180931Z 0 [ERROR] [MY-000000] [Galera] failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout)\n\t at ../../../../percona-xtradb-cluster-galera/gcomm/src/pc.cpp:connect():176\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.181000Z 0 [ERROR] [MY-000000] [Galera] ../../../../percona-xtradb-cluster-galera/gcs/src/gcs_core.cpp:gcs_core_open():256: Failed to open backend connection: -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181151Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181219Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181364Z 0 [ERROR] [MY-000000] [Galera] ../../../../percona-xtradb-cluster-galera/gcs/src/gcs.cpp:gcs_open():1952: Failed to open channel 'mysql-pxc-db-pxc' at 'gcomm://mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc': -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181382Z 0 [ERROR] [MY-000000] [Galera] gcs connect failed: Operation timed out\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181395Z 0 [ERROR] [MY-000000] [WSREP] Provider/Node (gcomm://mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc) failed to establish connection with cluster (reason: 7)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181418Z 0 [ERROR] [MY-010119] [Server] Aborting\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181771Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.41-32.1) Percona XtraDB Cluster (GPL), Release rel32, Revision 9cd31bf, WSREP version 26.1.4.3.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.182743Z 0 [ERROR] [MY-010065] [Server] Failed to shutdown components infrastructure.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.182939Z 0 [Note] [MY-000000] [Galera] dtor state: CLOSED\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.182971Z 0 [Note] [MY-000000] [Galera] MemPool(TrxHandleSlave): hit ratio: 0, misses: 0, in use: 0, in pool: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.186036Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.189065Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192222Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192260Z 0 [Note] [MY-000000] [Galera] cert index usage at exit 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192298Z 0 [Note] [MY-000000] [Galera] cert trx map usage at exit 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192308Z 0 [Note] [MY-000000] [Galera] deps set usage at exit 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192323Z 0 [Note] [MY-000000] [Galera] avg deps dist 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192333Z 0 [Note] [MY-000000] [Galera] avg cert interval 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192342Z 0 [Note] [MY-000000] [Galera] cert index size 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192407Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192481Z 0 [Note] [MY-000000] [Galera] wsdb trx map usage 0 conn query map usage 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192496Z 0 [Note] [MY-000000] [Galera] MemPool(LocalTrxHandle): hit ratio: 0, misses: 0, in use: 0, in pool: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192636Z 0 [Note] [MY-000000] [Galera] Shifting CLOSED -> DESTROYED (TO: 0)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.193683Z 0 [Note] [MY-000000] [Galera] Flushing memory map to disk...\n","file":"/var/lib/mysql/mysqld-error.log"}
PXC container in the Pod with the reboot loop
Cluster address set to: mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc
8.0.41-32.1
[mysqld]
pxc-encrypt-cluster-traffic=ON
ssl-ca=/etc/mysql/ssl-internal/ca.crt
ssl-key=/etc/mysql/ssl-internal/tls.key
ssl-cert=/etc/mysql/ssl-internal/tls.crt
wsrep_provider_options="pc.weight=10"
wsrep_sst_donor=mysql-pxc-db-pxc-1,
log-error=/var/lib/mysql/mysqld-error.log
log_error_suppression_list="MY-010055"
admin-address=10.1.0.144
authentication_policy=caching_sha2_password,,
skip_replica_start=ON
wsrep_notify_cmd=/var/lib/mysql/wsrep_cmd_notify_handler.sh
enforce-gtid-consistency
gtid-mode=ON
plugin_load="binlog_utils_udf=binlog_utils_udf.so"
datadir=/var/lib/mysql
socket=/tmp/mysql.sock
skip-host-cache
coredumper
server_id=42123722
binlog_format=ROW
default_storage_engine=InnoDB
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_file_per_table = 1
innodb_autoinc_lock_mode=2
bind_address = 0.0.0.0
wsrep_slave_threads=2
wsrep_cluster_address=gcomm://mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc
wsrep_provider=/usr/lib64/galera4/libgalera_smm.so
wsrep_cluster_name=mysql-pxc-db-pxc
wsrep_node_address=10.1.0.144
wsrep_node_incoming_address=mysql-pxc-db-pxc-2.mysql-pxc-db-pxc.default.svc.cluster.local:3306
wsrep_sst_method=xtrabackup-v2
[client]
socket=/tmp/mysql.sock
[sst]
cpat=.*\.pem$\|.*init\.ok$\|.*galera\.cache$\|.*wsrep_recovery_verbose\.log$\|.*readiness-check\.sh$\|.*liveness-check\.sh$\|.*get-pxc-state$\|.*sst_in_progress$\|.*sleep-forever$\|.*pmm-prerun\.sh$\|.*sst-xb-tmpdir$\|.*\.sst$\|.*gvwstate\.dat$\|.*grastate\.dat$\|.*\.err$\|.*\.log$\|.*RPM_UPGRADE_MARKER$\|.*RPM_UPGRADE_HISTORY$\|.*pxc-entrypoint\.sh$\|.*unsafe-bootstrap\.sh$\|.*pxc-configure-pxc\.sh\|.*peer-list$\|.*auth_plugin$\|.*version_info$\|.*mysql-state-monitor$\|.*mysql-state-monitor\.log$\|.*notify\.sock$\|.*mysql\.state$\|.*wsrep_cmd_notify_handler\.sh$
progress=1
+ [[ -z node:10-1-0-242.mysql-pxc-db-pxc-unready.default.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Synced:wsrep_cluster_status:Primary ]]
+ [[ -z node:10-1-0-242.mysql-pxc-db-pxc-unready.default.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Synced:wsrep_cluster_status:Primary ]]
+ [[ -z node:10-1-0-242.mysql-pxc-db-pxc-unready.default.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Synced:wsrep_cluster_status:Primary ]]
+ test -e /opt/percona/hookscript/hook.sh
+ init_opt=
+ [[ -f /etc/mysql/init-file/init.sql ]]
There are a whole lot more logs related to this, but the initial problem to solve is why it won’t reconnect. Deleting the pod and the underlying PVC doesn’t work. After the reconnection issue is resolved, the source of the crash should be identified.
Expected Result:
Pod to come back and rejoin the cluster without any fuss
Actual Result:
Pod goes into a reboot loop
Additional Information:
It seems like the reboot loop is caused by the pod needing to replicate the data to catch up to where it was but this takes longer than the time the pod has to rejoin and the liveness and readiness checks trigger failures causing it to reboot, restart since it didn’t finish, and die again.
