Hi Percona team,
We are facing a critical issue with Percona XtraDB Cluster (PXC) version 8.0.41-32.1 in both on-prem and AWS environments.
Infrastructure:
3-node Kubernetes cluster, version v1.32.5, deployed using Kubespray
Storage backend: Ceph (Rook)
PXC deployed via Helm chart: percona/pxc-db, version 1.17.0
PXC Operator: percona/pxc-operator --version 1.17.0
Issue Description:
When simulating a failover by disconnecting the network interface (NIC) of a node that runs one of the PXC pods:
All pods in the cluster crash, including those running on healthy nodes
The database enters a full PXC cluster crash recovery state
This issue occurs consistently in both AWS and on-premises environments
Log Sample:
2025-07-16T15:08:07.757324Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.41-32.1) Percona XtraDB Cluster (GPL), Release rel32, Revision 9cd31bf, WSREP version…
#####################################################FULL_PXC_CLUSTER_CRASH:my-db-pxc-db-pxc-0.my-db-pxc-db-pxc.pxc.svc.cluster.local#####################################################
You have the situation of a full PXC cluster crash. In order to restore your PXC cluster, please check the log
from all pods/nodes to find the node with the most recent data (the one with the highest sequence number (seqno).
It is my-db-pxc-db-pxc-0.my-db-pxc-db-pxc.pxc.svc.cluster.local node with sequence number (seqno): 67161
Cluster will recover automatically from the crash now.
If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:
kubectl -n pxc exec my-db-pxc-db-pxc-0 -c pxc – sh -c ‘kill -s USR1 1’
#####################################################LAST_LINE:my-db-pxc-db-pxc-0.my-db-pxc-db-pxc.pxc.svc.cluster.local:67161:#####################################################
Questions:
Why does disconnecting a single NIC trigger a total cluster crash?
Is there a configuration or HA best practice that can help isolate the failure to just the affected node?
How can we ensure PXC behaves resiliently under node-level network interruptions?