PXC cluster fails after single pod failure

mkl262 · January 10, 2024, 6:34pm

Description:

Hello, I have a 3 node kubernetes cluster running my application, a rook-ceph storage cluster, and a pxc operator and cluster.

During HA tests I found that the cluster stops completely when disconnecting the network from one of the nodes, which runs a single pxc pod. The two remaining pods stop the mysqld process and stop accepting any traffic.
It is not the expected behavior, and it causes my application to stop functioning.
After reconnecting the node to the network, the cluster comes up normally without any intervention.
Any input will be appreciated. Thank you.

Steps to Reproduce:

verify that cluster is running
disconnect one of the kubernetes nodes from the network(in this case it was only running pxc-1 pod)
wait for some time, cluster is still down
reconnect the node to the network
the cluster is restored

Version:

Operator version: 1.12.0
pxc-operator Helm version: 1.12.1
pxc-db Helm version: 1.12.0
percona-xtradb-cluster image: percona/percona-xtradb-cluster:5.7.43
kubernetes version: v1.25.12

Helm Values

pxc-operator-values.yaml.txt (268 Bytes)
pxc-values.yaml.txt (2.9 KB)

Logs:

See attached Log files
pxc-0.log (53.3 KB)
pxc-1.log (25.3 KB)
pxc-2.log (49.5 KB)
pxc-operator-pod.log (631.7 KB)

Expected Result:

The two remaining pods continue the traffic as usual.

Actual Result:

Both remaining pods stop the mysqld process and stop excepting any traffic

Sergey_Pronin · January 15, 2024, 10:10am

Hello @mkl262 ,

how do you disconnect the network from one of the nodes?
Also which CNI driver do you use?

Seems that you might be breaking the connectivity as a whole. I tried to reproduce it just now on GKE with chaos mesh, where I introduced network partinitioning between PXC nodes. And it all recovered just fine.

mkl262 · February 13, 2024, 10:30am

Hi,

I disconnect the nodes by disabling the network interface, or by physically disconnecting the connection.
I use the flannel CNI.

I also wasnt able to reproduce this issue in a cloud environment, I tried in AWS EKS, and it worked as expected

Sergey_Pronin · February 19, 2024, 10:00am

@mkl262 how did you reproduce it in EKS?
I may try to do it as well and see.

mkl262 · March 11, 2024, 9:25am

Hi,

I wasn’t able to reproduce it in EKS, the cluster was able to recover after losing one of its nodes.
I also upgraded the operator to 1.14, The mysql to 8.0.28, and the rke kubernetes to 1.27.11, and the issue is still reproducible.

Topic		Replies	Views
Full Cluster Crash During Network Failover – Same Behavior on AWS and On-Prem Percona XtraDB Cluster 8.x mysql , percona	1	13	July 18, 2025
PXC cluster for mysql is not choosing the secondary as primary Percona XtraDB Cluster 5.x mysql , percona	2	354	September 10, 2024
Operator fails to rejoin crashed nodes to cluster without deleting it manually Percona Operator for MySQL	3	175	December 19, 2024
Percona Operator for MySQL based on Percona xtradb cluster HA problems Percona Operator for MySQL	2	826	August 18, 2023
K8s XtraDB PXC Cluster, Third node fail to start Percona XtraDB Cluster 8.x	7	590	March 28, 2024