Description:
This happens frequently enough that it gets really annoying, out of the two database instances running they frequently enough crash with no easy to diagnose reason and cannot recover themselves leaving the pod in a CrashLoopBackoff
This looks like
mysql-pxc-db-pxc-0                                      4/4     Running            13 (5d17h ago)    59d
mysql-pxc-db-pxc-1                                      4/4     Running            0                 59d
mysql-pxc-db-pxc-2                                      3/4     CrashLoopBackOff   46 (75s ago)      4h9m
And a describe of the pod in question has the following segment
  pxc:
    Container ID:  containerd://7894352f5a174ba0a5201aeda313a26c7494d8ec3c200fd96fece476e5e8834c
    Image:         percona/percona-xtradb-cluster:8.0.41-32.1
    Image ID:      docker.io/percona/percona-xtradb-cluster@sha256:168ffb252d533b856a74820dea51c155bf5a8cb6a806a4d8a2e387ed7417a733
    Ports:         3306/TCP, 4444/TCP, 4567/TCP, 4568/TCP, 33062/TCP, 33060/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /var/lib/mysql/pxc-entrypoint.sh
    Args:
      mysqld
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 19 Aug 2025 01:08:51 +1200
      Finished:     Tue, 19 Aug 2025 01:09:38 +1200
    Ready:          False
    Restart Count:  45
    Limits:
      cpu:     2
      memory:  12G
    Requests:
      cpu:      600m
      memory:   8G
    Liveness:   exec [/var/lib/mysql/liveness-check.sh] delay=300s timeout=5s period=10s #success=1 #failure=3
    Readiness:  exec [/var/lib/mysql/readiness-check.sh] delay=15s timeout=15s period=30s #success=1 #failure=5
    Environment Variables from:
      mysql-pxc-db-env-vars-pxc  Secret  Optional: true
    Environment:
      PXC_SERVICE:                    mysql-pxc-db-pxc-unready
      MONITOR_HOST:                   %
      MYSQL_ROOT_PASSWORD:            <set to the key 'root' in secret 'internal-mysql-pxc-db'>        Optional: false
      XTRABACKUP_PASSWORD:            <set to the key 'xtrabackup' in secret 'internal-mysql-pxc-db'>  Optional: false
      MONITOR_PASSWORD:               <set to the key 'monitor' in secret 'internal-mysql-pxc-db'>     Optional: false
      LOG_DATA_DIR:                   /var/lib/mysql
      IS_LOGCOLLECTOR:                yes
      CLUSTER_HASH:                   4212372
      OPERATOR_ADMIN_PASSWORD:        <set to the key 'operator' in secret 'internal-mysql-pxc-db'>  Optional: false
      LIVENESS_CHECK_TIMEOUT:         5
      READINESS_CHECK_TIMEOUT:        15
      DEFAULT_AUTHENTICATION_PLUGIN:  caching_sha2_password
      MYSQL_NOTIFY_SOCKET:            /var/lib/mysql/notify.sock
      MYSQL_STATE_FILE:               /var/lib/mysql/mysql.state
And the following status
Events:
  Type     Reason     Age    From               Message
  ----     ------     ----   ----               -------
  Normal   Scheduled  2m25s  default-scheduler  Successfully assigned default/mysql-pxc-db-pxc-2 to ip-10-1-0-193.ap-southeast-2.compute.internal
  Normal   Pulling    2m22s  kubelet            Pulling image "percona/percona-xtradb-cluster-operator:1.17.0"
  Normal   Pulled     2m20s  kubelet            Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0" in 1.535s (1.535s including waiting). Image size: 87993197 bytes.
  Normal   Created    2m20s  kubelet            Created container pxc-init
  Normal   Started    2m20s  kubelet            Started container pxc-init
  Normal   Pulling    2m17s  kubelet            Pulling image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0"
  Normal   Pulled     2m16s  kubelet            Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0" in 1.544s (1.544s including waiting). Image size: 136426815 bytes.
  Normal   Created    2m16s  kubelet            Created container logs
  Normal   Started    2m16s  kubelet            Started container logs
  Normal   Pulling    2m16s  kubelet            Pulling image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0"
  Normal   Pulled     2m14s  kubelet            Successfully pulled image "percona/percona-xtradb-cluster-operator:1.17.0-logcollector-fluentbit4.0.0" in 1.487s (1.487s including waiting). Image size: 136426815 bytes.
  Normal   Created    2m14s  kubelet            Created container logrotate
  Normal   Started    2m14s  kubelet            Started container logrotate
  Normal   Pulled     2m13s  kubelet            Successfully pulled image "percona/percona-xtradb-cluster:8.0.41-32.1" in 1.559s (1.559s including waiting). Image size: 211194772 bytes.
  Normal   Pulling    2m13s  kubelet            Pulling image "prom/mysqld-exporter"
  Normal   Pulled     2m11s  kubelet            Successfully pulled image "prom/mysqld-exporter" in 1.494s (1.495s including waiting). Image size: 10954979 bytes.
  Normal   Created    2m11s  kubelet            Created container mysqld-exporter
  Normal   Started    2m11s  kubelet            Started container mysqld-exporter
  Warning  Unhealthy  112s   kubelet            Readiness probe failed: ERROR 2003 (HY000): Can't connect to MySQL server on '10.1.0.78:33062' (111)
+ [[ '' == \P\r\i\m\a\r\y ]]
+ exit 1
  Normal  Pulling  85s (x2 over 2m14s)  kubelet  Pulling image "percona/percona-xtradb-cluster:8.0.41-32.1"
  Normal  Created  84s (x2 over 2m13s)  kubelet  Created container pxc
  Normal  Started  84s (x2 over 2m13s)  kubelet  Started container pxc
  Normal  Pulled   84s                  kubelet  Successfully pulled image "percona/percona-xtradb-cluster:8.0.41-32.1" in 1.516s (1.516s including waiting). Image size: 211194772 bytes.
Another similar issue was posted but never resolved Pxc-db cluster unable to recover after crash - #5 by Michael_Coburn
Steps to Reproduce:
It’s intermittent and not easy to reproduce
Let a workload run for 60+ days receiving 50-100 rps across a very simple schema (there aren’t even foreign key relationships between tables), and it might crash.. might not.. could crash early, could not crash.
Version:
Operator: 1.17.0
Container: percona/percona-xtradb-cluster:8.0.41-32.1
Logs:
Log container in the Pod with the reboot loop
{"log":"2025-08-18T13:52:15.678791Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:16.677708Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:17.177883Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:18.177639Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:18.677776Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:19.677904Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:20.178119Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.177908Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: certificate verify failed: self-signed certificate\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.180756Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.180790Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node\nview ((empty))\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.180931Z 0 [ERROR] [MY-000000] [Galera] failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout)\n\t at ../../../../percona-xtradb-cluster-galera/gcomm/src/pc.cpp:connect():176\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:21.181000Z 0 [ERROR] [MY-000000] [Galera] ../../../../percona-xtradb-cluster-galera/gcs/src/gcs_core.cpp:gcs_core_open():256: Failed to open backend connection: -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181151Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181219Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181364Z 0 [ERROR] [MY-000000] [Galera] ../../../../percona-xtradb-cluster-galera/gcs/src/gcs.cpp:gcs_open():1952: Failed to open channel 'mysql-pxc-db-pxc' at 'gcomm://mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc': -110 (Connection timed out)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181382Z 0 [ERROR] [MY-000000] [Galera] gcs connect failed: Operation timed out\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181395Z 0 [ERROR] [MY-000000] [WSREP] Provider/Node (gcomm://mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc) failed to establish connection with cluster (reason: 7)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181418Z 0 [ERROR] [MY-010119] [Server] Aborting\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.181771Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.41-32.1)  Percona XtraDB Cluster (GPL), Release rel32, Revision 9cd31bf, WSREP version 26.1.4.3.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.182743Z 0 [ERROR] [MY-010065] [Server] Failed to shutdown components infrastructure.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.182939Z 0 [Note] [MY-000000] [Galera] dtor state: CLOSED\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.182971Z 0 [Note] [MY-000000] [Galera] MemPool(TrxHandleSlave): hit ratio: 0, misses: 0, in use: 0, in pool: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.186036Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.189065Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192222Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192260Z 0 [Note] [MY-000000] [Galera] cert index usage at exit 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192298Z 0 [Note] [MY-000000] [Galera] cert trx map usage at exit 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192308Z 0 [Note] [MY-000000] [Galera] deps set usage at exit 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192323Z 0 [Note] [MY-000000] [Galera] avg deps dist 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192333Z 0 [Note] [MY-000000] [Galera] avg cert interval 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192342Z 0 [Note] [MY-000000] [Galera] cert index size 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192407Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192481Z 0 [Note] [MY-000000] [Galera] wsdb trx map usage 0 conn query map usage 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192496Z 0 [Note] [MY-000000] [Galera] MemPool(LocalTrxHandle): hit ratio: 0, misses: 0, in use: 0, in pool: 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.192636Z 0 [Note] [MY-000000] [Galera] Shifting CLOSED -> DESTROYED (TO: 0)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2025-08-18T13:52:22.193683Z 0 [Note] [MY-000000] [Galera] Flushing memory map to disk...\n","file":"/var/lib/mysql/mysqld-error.log"}
PXC container in the Pod with the reboot loop
Cluster address set to: mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc
8.0.41-32.1
[mysqld]
pxc-encrypt-cluster-traffic=ON
ssl-ca=/etc/mysql/ssl-internal/ca.crt
ssl-key=/etc/mysql/ssl-internal/tls.key
ssl-cert=/etc/mysql/ssl-internal/tls.crt
wsrep_provider_options="pc.weight=10"
wsrep_sst_donor=mysql-pxc-db-pxc-1,
log-error=/var/lib/mysql/mysqld-error.log
log_error_suppression_list="MY-010055"
admin-address=10.1.0.144
authentication_policy=caching_sha2_password,,
skip_replica_start=ON
wsrep_notify_cmd=/var/lib/mysql/wsrep_cmd_notify_handler.sh
enforce-gtid-consistency
gtid-mode=ON
plugin_load="binlog_utils_udf=binlog_utils_udf.so"
datadir=/var/lib/mysql
socket=/tmp/mysql.sock
skip-host-cache
coredumper
server_id=42123722
binlog_format=ROW
default_storage_engine=InnoDB
innodb_flush_log_at_trx_commit  = 2
innodb_flush_method             = O_DIRECT
innodb_file_per_table           = 1
innodb_autoinc_lock_mode=2
bind_address = 0.0.0.0
wsrep_slave_threads=2
wsrep_cluster_address=gcomm://mysql-pxc-db-pxc-0.mysql-pxc-db-pxc,mysql-pxc-db-pxc-1.mysql-pxc-db-pxc
wsrep_provider=/usr/lib64/galera4/libgalera_smm.so
wsrep_cluster_name=mysql-pxc-db-pxc
wsrep_node_address=10.1.0.144
wsrep_node_incoming_address=mysql-pxc-db-pxc-2.mysql-pxc-db-pxc.default.svc.cluster.local:3306
wsrep_sst_method=xtrabackup-v2
[client]
socket=/tmp/mysql.sock
[sst]
cpat=.*\.pem$\|.*init\.ok$\|.*galera\.cache$\|.*wsrep_recovery_verbose\.log$\|.*readiness-check\.sh$\|.*liveness-check\.sh$\|.*get-pxc-state$\|.*sst_in_progress$\|.*sleep-forever$\|.*pmm-prerun\.sh$\|.*sst-xb-tmpdir$\|.*\.sst$\|.*gvwstate\.dat$\|.*grastate\.dat$\|.*\.err$\|.*\.log$\|.*RPM_UPGRADE_MARKER$\|.*RPM_UPGRADE_HISTORY$\|.*pxc-entrypoint\.sh$\|.*unsafe-bootstrap\.sh$\|.*pxc-configure-pxc\.sh\|.*peer-list$\|.*auth_plugin$\|.*version_info$\|.*mysql-state-monitor$\|.*mysql-state-monitor\.log$\|.*notify\.sock$\|.*mysql\.state$\|.*wsrep_cmd_notify_handler\.sh$
progress=1
+ [[ -z node:10-1-0-242.mysql-pxc-db-pxc-unready.default.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Synced:wsrep_cluster_status:Primary ]]
+ [[ -z node:10-1-0-242.mysql-pxc-db-pxc-unready.default.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Synced:wsrep_cluster_status:Primary ]]
+ [[ -z node:10-1-0-242.mysql-pxc-db-pxc-unready.default.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Synced:wsrep_cluster_status:Primary ]]
+ test -e /opt/percona/hookscript/hook.sh
+ init_opt=
+ [[ -f /etc/mysql/init-file/init.sql ]]
There are a whole lot more logs related to this, but the initial problem to solve is why it won’t reconnect. Deleting the pod and the underlying PVC doesn’t work. After the reconnection issue is resolved, the source of the crash should be identified.
Expected Result:
Pod to come back and rejoin the cluster without any fuss
Actual Result:
Pod goes into a reboot loop
Additional Information:
It seems like the reboot loop is caused by the pod needing to replicate the data to catch up to where it was but this takes longer than the time the pod has to rejoin and the liveness and readiness checks trigger failures causing it to reboot, restart since it didn’t finish, and die again.
