Hi,
I have a 3-node cluster, each node is in a different datacenter.
Sometimes connectivity is intermittent between two nodes. When this happens, one node goes into recovery while a second becomes the donor. The remaining node stays online as expected.
The problem is that, if the connectivity breaks again while the donor is repairing the failed node, the 3 nodes seem to hang in their respective states,. That is, the donor never leaves donor mode, and the failed node is never recovered.
The only solution is to kill the donor manually (a ‘mysql stop’ just hangs) and then recover the donor first, but that means the entire cluster is effectively down as the last remaining node then becomes a donor itself. Obviously this is not ideal.
So, the question is, what is the expected behaviour under these conditions? I would expect if a recovery is interrupted, the donor would online himself and then the recovery of the failed node would begin again when connectivity is restored? Is that correct?