Scenario is this:
All Xtradb 5.5.28
Site A has three nodes and a garbd arbitrator.
Site B has three nodes (so by design is the minority, I do not think there is any way for it not to be)
There is gigabit networking with low latency between the sites.
The same application and different users connecting to either Site A or site B.
A hardware failure of the storage for site A nodes or the network between them fails, but not a planned failure.
I observe Site B nodes demote themselves and the cluster nodes do not carry on working and they report they are in a minority.
The salient wsrep variables will look like this:
|
| wsrep_cluster_status | non-Primary |
| wsrep_connected | ON |
| wsrep_provider_name | Galera |
| wsrep_ready | OFF
So you have to restart the Site B databases to get the Site B working again. Or is there a simple command you can run? As I have to restart the databases that adds further delay to a failure situation that is already a problem.
Then time passes, say a couple hours so you know there has been a quantity of new data added to the Site B nodes, and site A hardware is working again. With XtraDB 5.5.28, can you rely on IST/SST to resync Site A nodes (in testing seems to work if site B nodes were running as primary when site A are restarted) or should you run rm -rf on the data dir on all the nodes in site A and bring up one, once up and in sync with site B allow it to then a donor for the other nodes in site A.
If it is a planned outage, software upgrade, storage change, or any one of a zillion typical HA system tasks… then should you start a garbd arbitrator on Site B first to stop the demote of site B nodes, or should you just turn off gardb arbitrator on site A before the planned failover?
If it were not planned I can see two situations for the data on Site A one you KNOW the data in Site A is good and the second you are not quite so sure. What should you check for (understand table checksums etc looking for specific XtraDB checks)? Can XtraDB sort out any data discrepancies when it resyncs?
I can see a rm -rf of the data dir on Site A also works and it is fine if this were only a few GB but lets say the db is a couple TB and the restore would take over an hour, hardly HA and do I REALLY need to do it?
Interested to hear views?