Percona Operator for MySQL based on Percona xtradb cluster HA problems

Description:

Hello,I have 3 workers available. We are using the Percona Operator for MySQL solution based on Percona xtradb cluster in a helm deployment with 3 replicas for HA[kubernetes].
But when we disabled two workers, the cluster stopped responding, although one worker was available ha proxy and pxc hung in this state:

xmysql-pxc-db-haproxy-0                          2/2     Terminating   5 (50m ago)   70m
xmysql-pxc-db-haproxy-1                          1/2     Running       8 (11s ago)   69m
xmysql-pxc-db-haproxy-2                          2/2     Terminating   5 (49m ago)   68m
xmysql-pxc-db-pxc-0                              3/3     Terminating   1 (46m ago)   48m
xmysql-pxc-db-pxc-1                              2/3     Running       4 (12m ago)   69m
xmysql-pxc-db-pxc-2                              3/3     Terminating   1 (46m ago)   47m

Is this the correct behavior of the cluster?
Is it possible to make a cluster with 3 nodes work after excluding 2 nodes from the cluster, that is, on 1 node and will this not damage the operation of the database?

Hey @Pavlo_Tkachenko,
By “disabled” and “Terminating”, I’m thinking this is similar to kill -9? If that is the case, then killing 2 PXC nodes at the same time will certainly lead to this situation. The quorum count of a PXC increases when a node connects and joins. The count only decreases on graceful shutdown of a node.

48m ago, your pxc-1 saw a count of 3/3 and all was good. 2m later, the count is now 1/3. pxc-1 no longer has a majority of nodes functioning thus the cluster goes into a non-primary state. This is quite typical behavior of any cluster (ie: loss of quorum / loss of majority = cluster shutdown).

If you want to make this single node begin working again, you must force it back online from this unstable state. Connect to MySQL within that pod and run

SET GLOBAL wsrep_provider_options='pc.bootstrap=true';

This will reset the quorum counter to 1 on this node and it will come back online.

Had you terminated just 1 of the 3, the cluster would have remained as 2/3 is a majority. Or, had you gracefully shut down pxc-0, then pxc-2, that would have decremented the quorum counter and eventually reached 1/1 and it would have remained online.

What you experienced is correct and expected behavior to protect the data and the cluster from errant writes and potential split-brain situations.

Thank you for your reply!