Do we trust IST to resync or do we have to rm-rf data dir when rejoining partitioned cluster?

rdab100 · January 25, 2013, 5:46am

Scenario is this:

All Xtradb 5.5.28

Site A has three nodes and a garbd arbitrator.
Site B has three nodes (so by design is the minority, I do not think there is any way for it not to be)

There is gigabit networking with low latency between the sites.
The same application and different users connecting to either Site A or site B.

A hardware failure of the storage for site A nodes or the network between them fails, but not a planned failure.

I observe Site B nodes demote themselves and the cluster nodes do not carry on working and they report they are in a minority.

The salient wsrep variables will look like this:

So you have to restart the Site B databases to get the Site B working again. Or is there a simple command you can run? As I have to restart the databases that adds further delay to a failure situation that is already a problem.

Then time passes, say a couple hours so you know there has been a quantity of new data added to the Site B nodes, and site A hardware is working again. With XtraDB 5.5.28, can you rely on IST/SST to resync Site A nodes (in testing seems to work if site B nodes were running as primary when site A are restarted) or should you run rm -rf on the data dir on all the nodes in site A and bring up one, once up and in sync with site B allow it to then a donor for the other nodes in site A.

If it is a planned outage, software upgrade, storage change, or any one of a zillion typical HA system tasks… then should you start a garbd arbitrator on Site B first to stop the demote of site B nodes, or should you just turn off gardb arbitrator on site A before the planned failover?

If it were not planned I can see two situations for the data on Site A one you KNOW the data in Site A is good and the second you are not quite so sure. What should you check for (understand table checksums etc looking for specific XtraDB checks)? Can XtraDB sort out any data discrepancies when it resyncs?

I can see a rm -rf of the data dir on Site A also works and it is fine if this were only a few GB but lets say the db is a couple TB and the restore would take over an hour, hardly HA and do I REALLY need to do it?

Interested to hear views?

rdab100 · January 28, 2013, 4:18am

Anyone from Percona care to comment?

rdab100 · February 6, 2013, 8:33am

Does nobody know or is this just a dumb question?

Dom

Topic		Replies	Views
weird behavior of IST syncronization Percona XtraDB Cluster 5.x	1	2717	January 15, 2018
IST fallback to SST due to safe_ist_seqno Percona XtraDB Cluster 5.x	0	723	April 17, 2019
Problem getting IST to work Percona XtraDB Cluster 5.x	5	1688	March 30, 2020
Randomly IST fail Percona XtraDB Cluster 5.x	4	1051	May 22, 2015
Cluster re-sync causes much lock Percona XtraDB Cluster 5.x mysql	2	900	June 23, 2021

Do we trust IST to resync or do we have to rm-rf data dir when rejoining partitioned cluster?

Related topics