Full Cluster Crash During Network Failover – Same Behavior on AWS and On-Prem

Israel_Vinitzer · July 16, 2025, 4:50pm

Hi Percona team,

We are facing a critical issue with Percona XtraDB Cluster (PXC) version 8.0.41-32.1 in both on-prem and AWS environments.

Infrastructure:
3-node Kubernetes cluster, version v1.32.5, deployed using Kubespray
Storage backend: Ceph (Rook)
PXC deployed via Helm chart: percona/pxc-db, version 1.17.0
PXC Operator: percona/pxc-operator --version 1.17.0

Issue Description:
When simulating a failover by disconnecting the network interface (NIC) of a node that runs one of the PXC pods:
All pods in the cluster crash, including those running on healthy nodes
The database enters a full PXC cluster crash recovery state
This issue occurs consistently in both AWS and on-premises environments
Log Sample:
2025-07-16T15:08:07.757324Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.41-32.1) Percona XtraDB Cluster (GPL), Release rel32, Revision 9cd31bf, WSREP version…

#####################################################FULL_PXC_CLUSTER_CRASH:my-db-pxc-db-pxc-0.my-db-pxc-db-pxc.pxc.svc.cluster.local#####################################################
You have the situation of a full PXC cluster crash. In order to restore your PXC cluster, please check the log
from all pods/nodes to find the node with the most recent data (the one with the highest sequence number (seqno).
It is my-db-pxc-db-pxc-0.my-db-pxc-db-pxc.pxc.svc.cluster.local node with sequence number (seqno): 67161
Cluster will recover automatically from the crash now.
If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:
kubectl -n pxc exec my-db-pxc-db-pxc-0 -c pxc – sh -c ‘kill -s USR1 1’
#####################################################LAST_LINE:my-db-pxc-db-pxc-0.my-db-pxc-db-pxc.pxc.svc.cluster.local:67161:#####################################################

Questions:
Why does disconnecting a single NIC trigger a total cluster crash?
Is there a configuration or HA best practice that can help isolate the failure to just the affected node?
How can we ensure PXC behaves resiliently under node-level network interruptions?

Abhinav_Gupta · July 18, 2025, 3:42pm

Are you confirm whether only a single PXC pod was scheduled on that node and nothing else related to PXC pod or any other PXC operator related pods/components? You may find more clarity by reviewing the error logs of all three PXC pods to identify the exact cause leading up to the full PXC crash. It appears that the cluster may have lost quorum entirely. As a first step, we recommend checking the error logs of all three PXC pods to determine the root cause.

Topic		Replies	Views
PXC cluster fails after single pod failure Percona Operator for MySQL	4	553	March 11, 2024
PXC cluster for mysql is not choosing the secondary as primary Percona XtraDB Cluster 5.x mysql , percona	2	353	September 10, 2024
PXC cluster CrashLoopBackOff Percona XtraDB Cluster 5.x mysql , percona , kubernetes	6	1505	February 27, 2024
Operator fails to rejoin crashed nodes to cluster without deleting it manually Percona Operator for MySQL	3	175	December 19, 2024
Pxc-db cluster unable to recover after crash Percona Operator for MySQL percona	4	103	February 27, 2025

Full Cluster Crash During Network Failover – Same Behavior on AWS and On-Prem

Related topics