Hello!
I’ve installed version 1.10.0 of the mysql percona operator onto a k8s cluster. The cluster is physically separated onto 3 different locations. Two locations have each 2 nodes and the third location contains the fifth node. I’ve configured manual PVs and a StorageClass. I’ve configured the StatefulSet of haproxy
and pxc
to have each 5 pods deployed. I did that in cr.yaml. My frontend application can connect to the PXC Cluster and everything works fine. I can kill single pods and my frontend application still works. The deleted pod comes back up and joins back into the cluster. Great.
I want to test further realistic HA failure scenarios. Eg. all physical sites loose power thus the whole k8s cluster is now gone. Power gets reestablished, but only on two sites. The third site has an electrical fault which can’t be fixed easily.
When I now kill all PXC pods to simulate this unplanned downtime of the whole cluster with k delete pod cluster1-pxc-{0..4}
following happens:
- all cluster1-pxc pods get terminated (as expected)
- the readiness probes of all the cluster1-haproxy pods changes from 2/2 to 1/2 due to the unavailability of the PXC cluster (as expected)
- the cluster1-pxc pods gets started one after another starting with pod suffix zero (as expected)
…but here comes the problem! The critical issue is that unless ALL (meaning 5/5) PXC pods are up an running again, the haproxy cluster won’t be allowing connections to the PXC cluster. In my scenario explained above one site doesn’t have electricity for now. So I am stuck with 3/5 ready PXC pods and 1/2 ready state of the haproxy pods and my frontend application is no longer working, even tho the PXC cluster is formed.
How do I tell haproxy to allow connections as soon as a quorum is reached in the PXC cluster?
Thank you for your input.
P.S.: Interestingly when deploying the cr.yaml the first time ha-proxy
starts serve connections as soon as PXC quorum is reached.
I also tried to play with the PodDisruptionBuget but alass no improvement on that front either.