IST or SST interrupted

Hi,

I have a 3-node cluster, each node is in a different datacenter.

Sometimes connectivity is intermittent between two nodes. When this happens, one node goes into recovery while a second becomes the donor. The remaining node stays online as expected.

The problem is that, if the connectivity breaks again while the donor is repairing the failed node, the 3 nodes seem to hang in their respective states,. That is, the donor never leaves donor mode, and the failed node is never recovered.

The only solution is to kill the donor manually (a ‘mysql stop’ just hangs) and then recover the donor first, but that means the entire cluster is effectively down as the last remaining node then becomes a donor itself. Obviously this is not ideal.

So, the question is, what is the expected behaviour under these conditions? I would expect if a recovery is interrupted, the donor would online himself and then the recovery of the failed node would begin again when connectivity is restored? Is that correct?

Did you check for the presence of xtrabackup process on the donor while it’s state transfer is interrupted? Also the donor is still operational, just in “donor/desynced” state, right? In that case, IMHO there is no need to kill it, instead killing the backup processes should help. Btw. what is the version you are using?