Disaster Recovery techniques for xtradb cluster

mandm · March 17, 2014, 3:39pm

Hi I would like to know different disaster recovery techniques that people use when implementing a xtradb cluster
I have a huge implementation of these clusters in production and am always worried that if 2 of the 3 nodes crash then the only way for me to recover is using a downtime when node2 needs to be synced and node 3 becomes a donor

what have been some of your experiences and how have you tackled them?

przemek · March 23, 2014, 4:17pm

Hi,
Crashing 2 out of 3 nodes is indeed kind of a disaster since the one left will consider this as potential split brain, hence it will go non-primary.
But this is mostly about HA (high availability) rather then disaster it seems. Disaster for me means some weird state like real split brain or data inconsistency, but in case of InnoDB+Galera the chance is pretty low. Recovering from such usually means finding the most advanced node, and do a full sync from it to the rest of nodes (and fixing the root cause if you know one).
Then it is how to improve HA. Basically lower the chance of loosing the quorum - place each node in different blade/rack/power circuit/etc, have the enough number of nodes (use garbd nodes to achieve that with lowest costs), etc.
Then, in case of any node(s) go down - increase the chance such node will re-join using IST rather then SST:
[url]http://www.mysqlperformanceblog.com/2014/01/08/finding-good-ist-donor-percona-xtradb-cluster-5-6/[/url]
[url]http://www.mysqlperformanceblog.com/2012/08/02/avoiding-sst-when-adding-new-percona-xtradb-cluster-node/[/url]
You can also adjust the quorum settings if there are any less reliable or less important nodes in the cluster:
[url]http://www.codership.com/wiki/doku.php?id=weighted_quorum[/url]

mandm · March 29, 2014, 10:26pm

Thanks przemek for the links, yes i meant more so of a recovery from a total cluster failure, a scenario where node 2 and node 3 are down and node 1 is the only one active.
and to recover node 2 from node 1 we might need an SST. how would someone do it without assuming a few hours of downtime?

I will check those links now…

przemek · April 23, 2014, 3:13am

If node1 stays alive while two other nodes went down, there is still pretty good chance those nodes can re-join by doing IST. If node1 also goes into non-primary state, you only need to tell it it’s the primary component before joining the other nodes. The command to achieve this:
SET GLOBAL wsrep_provider_options=‘pc.bootstrap=true’;

Also, SST does not necessary mean complete downtime, the primary node which is a donor, can still serve queries, but it may be too slow to handle required workload though. Also, you can restore the other nodes from last backup and then also IST may be possible if gcache on primary node can store the transactions since last backup. Details in the link I posted before: [url]http://www.mysqlperformanceblog.com/2012/08/02/avoiding-sst-when-adding-new-percona-xtradb-cluster-node/[/url]

Topic		Replies	Views
MySQL stops handling requests when restarting mysql on other nodes --- donor/desync Percona XtraDB Cluster 5.x	4	3923	July 1, 2014
Required best and quick Method for cluster recovery and failover.. Percona XtraDB Cluster 5.x	5	1683	February 24, 2014
Joining second node without cluster lock Percona XtraDB Cluster 5.x	7	1443	June 15, 2016
Cluster crashed Percona XtraDB Cluster 5.x	1	345	December 14, 2012
Issues with Xtradb cluster Percona XtraDB Cluster 5.x	8	2117	February 13, 2014

Disaster Recovery techniques for xtradb cluster

Related topics