MySQL stops handling requests when restarting mysql on other nodes --- donor/desync

In our cluster, a node will experience an issue from time to time. When this happens, nodes 2 and 3 will crash resulting in a

ERROR! MySQL (Percona XtraDB Cluster) is not running, but PID file exists

If I restart mysql on the failed nodes, our Node 1 will no longer service mysql requests. Node 1 will show

| wsrep_local_state_comment | Donor/Desynced 

until Node 2 and Node 3 receive updates. After this, MySQL is OK.

This is an issue because I must wait until late at night to restart nodes 2 and 3 to allow our website to function.

I think you are having a split brain situation
The rule is that after any kind of failure, a galera node will
consider itself part of the primary partition if it can still see a
majority of the nodes that were in the cluster before the failure.
Majority > 50%.
So if you have 3 nodes and one goes away, the 2 remaining are fine.
If you have 3 nodes and 2 go away simultaneously, the 1 remaining must
assume it is the one having problems and will go offline.
If you have 3 nodes and 1 go way, then you have a 2 node cluster. This
is not good. Now if any 1 node goes away, the other one alone is not
in majority so it will have to be offline. The same is true if you
have a 4 node cluster and simultaneously lose 2 nodes. Etc…
But all is not lost. The node is still there, and if you as a human
being know it is the right thing to do, then you can run some manual
command to re-activate that node again (such as the command given by
Haris, or just restart, etc…).

There was a whole article on unknown commands and split brain situations on the perconas site but I cant seem to find it
In order to restore the cluster, execute below command on the working node and it will establish this node to form the primary component again and restarting previously crashed nodes will join the cluster hopefully.
mysql> SET GLOBAL wsrep_provider_options=‘pc.bootstrap=true’;
ariable wsrep_ready variable is set to 0)

Hey zmahomedy,
Thanks for the reply. Sorry, I worded it incorrectly.

Node 1 always works serving mysql queries (r/w) until I restart the 2 dead nodes. The restart of the dead nodes is what prompts the good node to temporarily go offline to sync itself to the bad nodes. After a full SST is sent, I have a 3 node cluster once again.

The question is why node2 and node3 are crashing. Usually when a node is restarted, full SST is not needed, just fast IST. So your nodes perhaps are shut down due to some inconsistency. The answer should be in their error logs.

OK I think I have located the cuplrit. It looks like converting the MyISAM tables to InnoDB has caused the other nodes to crash. We have haproxy hitting node 1 primarily so it hits node 1 and then syncs the changes to node 2 and 3. The nodes 2 and 3 break after this and require a full SST. I made a new post here detailing the issue.