Full Cluster Crash During Network Failover – Same Behavior on AWS and On-Prem

Hi Percona team,

We are facing a critical issue with Percona XtraDB Cluster (PXC) version 8.0.41-32.1 in both on-prem and AWS environments.

Infrastructure:
3-node Kubernetes cluster, version v1.32.5, deployed using Kubespray
Storage backend: Ceph (Rook)
PXC deployed via Helm chart: percona/pxc-db, version 1.17.0
PXC Operator: percona/pxc-operator --version 1.17.0

Issue Description:
When simulating a failover by disconnecting the network interface (NIC) of a node that runs one of the PXC pods:
All pods in the cluster crash, including those running on healthy nodes
The database enters a full PXC cluster crash recovery state
This issue occurs consistently in both AWS and on-premises environments
Log Sample:
2025-07-16T15:08:07.757324Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.41-32.1) Percona XtraDB Cluster (GPL), Release rel32, Revision 9cd31bf, WSREP version…

#####################################################FULL_PXC_CLUSTER_CRASH:my-db-pxc-db-pxc-0.my-db-pxc-db-pxc.pxc.svc.cluster.local#####################################################
You have the situation of a full PXC cluster crash. In order to restore your PXC cluster, please check the log
from all pods/nodes to find the node with the most recent data (the one with the highest sequence number (seqno).
It is my-db-pxc-db-pxc-0.my-db-pxc-db-pxc.pxc.svc.cluster.local node with sequence number (seqno): 67161
Cluster will recover automatically from the crash now.
If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:
kubectl -n pxc exec my-db-pxc-db-pxc-0 -c pxc – sh -c ‘kill -s USR1 1’
#####################################################LAST_LINE:my-db-pxc-db-pxc-0.my-db-pxc-db-pxc.pxc.svc.cluster.local:67161:#####################################################

Questions:
Why does disconnecting a single NIC trigger a total cluster crash?
Is there a configuration or HA best practice that can help isolate the failure to just the affected node?
How can we ensure PXC behaves resiliently under node-level network interruptions?

Are you confirm whether only a single PXC pod was scheduled on that node and nothing else related to PXC pod or any other PXC operator related pods/components? You may find more clarity by reviewing the error logs of all three PXC pods to identify the exact cause leading up to the full PXC crash. It appears that the cluster may have lost quorum entirely. As a first step, we recommend checking the error logs of all three PXC pods to determine the root cause.