MariaDB/Galera, donor node stops responding when SST fails to a new node, and brings the cluster down,

just joined here. We’re experimenting with a Galera cluster on MariaDB,and while we were able to set up a 3 nodes cluster without issues, we managed to get a bit perplexed while experimenting to gain some confidence.

So, we have a 3 nodes setup on

  • mariadb 10.5.10
  • galera cluster 4.6

and it’s working flawless.

  • We decided to add a 4th node in another datacenter for the purpose of evaluating latency. Mistakenly, the network the new node is configured such that, while allowing it to connect to the cluster, doesn’t let the cluster connect to it.
  • The new node joins the cluster, a donor node is selected for rsync SST.
  • SST fails because the rsync connection can’t be established, and for some reason, the donor node immediately sees the network as partitioned, and shuts itself down, going into a state where wsrep_ready=OFF, wsrep_connected=OFF, and while mariadb appears to be running, the wsrep provider appears dead.
  • The new node is disconnected from the cluster, re-connects, a new donor is selected, and things repeat until the entire cluster is down and unresponsive.
  • Each node is in a state that the mysql process can’t be terminated gracefully, so we need to kill each, and go through the process of boostrapping the cluster.

That’s a bit unsettling. I understand I can fix the network issues, but, still…

Hello @hobbes,
rsync is an outdated method for doing SST. You should switch to the xtrabackup-v2 method. Also, ensure you don’t have any firewall/iptables/etc blocking 4444 and 4567.
The fact that the donor node shuts itself down after failing to send an SST is a clear bug and should be reported to MariaDB.
I would recommend that you try out Percona XtraDB Cluster 8 for a more reliable experience.

@hobbes I concur with @matthewb .
In Percona XtraDB Cluster 8.0.23 we fixed a lot of bugs
Chaos Testing Leads to More Stable Percona XtraDB Cluster - Percona Database Performance Blog ,
which are likely are still present in MariaDB.
I recommend you try our version.