PXC cluster fails after single pod failure

Description:

Hello, I have a 3 node kubernetes cluster running my application, a rook-ceph storage cluster, and a pxc operator and cluster.

During HA tests I found that the cluster stops completely when disconnecting the network from one of the nodes, which runs a single pxc pod. The two remaining pods stop the mysqld process and stop accepting any traffic.
It is not the expected behavior, and it causes my application to stop functioning.
After reconnecting the node to the network, the cluster comes up normally without any intervention.
Any input will be appreciated. Thank you.

Steps to Reproduce:

  1. verify that cluster is running
  2. disconnect one of the kubernetes nodes from the network(in this case it was only running pxc-1 pod)
  3. wait for some time, cluster is still down
  4. reconnect the node to the network
  5. the cluster is restored

Version:

Operator version: 1.12.0
pxc-operator Helm version: 1.12.1
pxc-db Helm version: 1.12.0
percona-xtradb-cluster image: percona/percona-xtradb-cluster:5.7.43
kubernetes version: v1.25.12

Helm Values

pxc-operator-values.yaml.txt (268 Bytes)
pxc-values.yaml.txt (2.9 KB)

Logs:

See attached Log files
pxc-0.log (53.3 KB)
pxc-1.log (25.3 KB)
pxc-2.log (49.5 KB)
pxc-operator-pod.log (631.7 KB)

Expected Result:

The two remaining pods continue the traffic as usual.

Actual Result:

Both remaining pods stop the mysqld process and stop excepting any traffic

Hello @mkl262 ,

how do you disconnect the network from one of the nodes?
Also which CNI driver do you use?

Seems that you might be breaking the connectivity as a whole. I tried to reproduce it just now on GKE with chaos mesh, where I introduced network partinitioning between PXC nodes. And it all recovered just fine.

Hi,

I disconnect the nodes by disabling the network interface, or by physically disconnecting the connection.
I use the flannel CNI.

I also wasnt able to reproduce this issue in a cloud environment, I tried in AWS EKS, and it worked as expected

@mkl262 how did you reproduce it in EKS?
I may try to do it as well and see.

Hi,

I wasn’t able to reproduce it in EKS, the cluster was able to recover after losing one of its nodes.
I also upgraded the operator to 1.14, The mysql to 8.0.28, and the rke kubernetes to 1.27.11, and the issue is still reproducible.