Here is the problem description:
We have four nodes in the cluster, two in one data center and two in another. They have been properly grouped by the location. Now one of the nodes (Node 1.1) died over night. The remaining three are showing synced status and the cluster works. Here is the problem. When I’ve tried to start the node back up it has chosen to do SST state transfer and selected it’s closest neighbor as donor:
Node 1.1: Node 1.1 requested state transfer from ‘any’. Selected 2.1 (SYNCED) as donor.
The problem is, that the donor itself apparently had a different idea about that:
Node 2.1: Node 1.1 requested state transfer from ‘any’. Selected 3.2 (SYNCED) as donor.
and the whole process freezes.
For additional info - remaining nodes seem to agree with the one requesting transfer:
Node 3.2: Node 1.1 requested state transfer from ‘any’. Selected 2.1 (SYNCED) as donor.
Node 4.2: Node 1.1 requested state transfer from ‘any’. Selected 2.1 (SYNCED) as donor.
Is there any way to overcome this issue? Force the 1.1 node to use 3.2 as donor?