Hi all,
I encountered a strange problem with an installation of Percona XtraDB Cluster v. 5.7.
My configuration respect the standard installation guide, 3 nodes (all master) with one node started how bootstrap at cluster startup time; HAProxy is configured on the cluster with failover check, with “leastconn” configuration (how described here: [URL=“https://www.percona.com/doc/percona-xtradb-cluster/LATEST/howtos/haproxy.html”]https://www.percona.com/doc/percona-...s/haproxy.html[/URL]).
My cluster work well but we have encountered a disaster scenario when, for some reason, node-1 (the bootstrapped) had network problems so HAProxy exclude it from the cluster and node-2 ed node-3 preserve all funcionality as expected. And here the problem: node-1 is returned online and, for reasons unknown to me, if a write occurred on Primary Component (node-2 & node-3), it create another cluster with the result that HAProxy, on the basis of “clustercheck” script, re-add node-1 to “global cluster” causing a Split-brain scenario with random writing on node-1 and node-2/node-3.
Is this a normal behavior? There is a problem in my configuration? If the connettivity problem happen on node-2 or node-3 the cluster it behaves as expected (node-2 go offline for some seconds, node-1 and node-3 keep active the cluster, write on cluster some data, reconnect node-2 than resync own data with donor node and finally node-2 is added to cluster ready to work).
Hi Kenn,
yes, node-1 starting in bootstrapping mode (how described in the official guide) but in my case it did not restart, it lost connectivity to rest of cluster and when return online create new cluster instead of aligning with the remaining nodes. Is that a wrong configuration? How should it work to avoid connectivity problems? I assumed that bootstrapping configuration (for 1 node) is the standard way for starting the cluster: why does the node create a new cluster if it returns online after accidentally network problem? More generally, what should be the ideal configuration for a production environment to avoid problems of this type?
Yes, all node ping each others and the for error log I have to replicate the scenario (stay tuned!).
What do you mean with “It is likely that you have to tweak grastate.dat”? My installation (3 CentOS 7 VMs with HAProxy + Keepalived + Percona XtraDB Cluster) follows the configuration in the official guide, where there is a Bootstrap node and two other nodes all with weight “1”. Why if
I unplug the bootstrap node from the network
write some data to other two node
reconnect bootstrap node to the network
node 1 (the bootstrapper) creates a new cluster (creating the disastrous scenario in which you can write in 2 different cluster - split brain scenario)?
My Percona XtraDB Cluster, configured based on this guide - https://www.percona.com/doc/percona-xtradb-cluster/LATEST/howtos/centos_howto.html - is in multi master configuration or not?
How I mean it, multi-master cluster should avoid this particular (but not rare) situation.
my scenario is very similar to yours, but I use ubuntu … you only use the bootstrap to upload the first node, then you can go up normally, after the cluster is in the air you can stop any node that it will continue to function usually … even if it is the bootstrap node.
at least in my case the haproxy is based on the clustercheck response to release the node as active in loadbalance.
I unplug the bootstrap node from the network
haproxy will identify and send the requests to the next node.
write some data to other two node
OK
reconnect bootstrap node to the network
Percona will be responsible for replicating the data,
What do you mean with “then you can go up normally”? In the official guide is not mentioned anywhere that I have to “disable” bootstrapper mode for node-1 after start MySQL process. I suppose that after started, node-1 is “master” as the others.
Furthermore, in your scenario you jumped a step: after point 1 you must write some data to node-1, then write some data to other two node and only now try to reconnect network of node-1. This can happen when haproxy do not notice in time that node-1 is “out of sync” (result of clusterchek command). Or, for example, if there is a network error between nodes BUT node-1 is still reachable into network. Of course, I can adjust haproxy parameters to mitigate this problem but this is exactly what happened during my tests.
It is important to note how this problem not happened if losing connections are node 2 or 3 (in this case cluster work as expected).