Disaster Recovery techniques for xtradb cluster

Hi I would like to know different disaster recovery techniques that people use when implementing a xtradb cluster
I have a huge implementation of these clusters in production and am always worried that if 2 of the 3 nodes crash then the only way for me to recover is using a downtime when node2 needs to be synced and node 3 becomes a donor

what have been some of your experiences and how have you tackled them?

Hi,
Crashing 2 out of 3 nodes is indeed kind of a disaster since the one left will consider this as potential split brain, hence it will go non-primary.
But this is mostly about HA (high availability) rather then disaster it seems. Disaster for me means some weird state like real split brain or data inconsistency, but in case of InnoDB+Galera the chance is pretty low. Recovering from such usually means finding the most advanced node, and do a full sync from it to the rest of nodes (and fixing the root cause if you know one).
Then it is how to improve HA. Basically lower the chance of loosing the quorum - place each node in different blade/rack/power circuit/etc, have the enough number of nodes (use garbd nodes to achieve that with lowest costs), etc.
Then, in case of any node(s) go down - increase the chance such node will re-join using IST rather then SST:
[url]http://www.mysqlperformanceblog.com/2014/01/08/finding-good-ist-donor-percona-xtradb-cluster-5-6/[/url]
[url]http://www.mysqlperformanceblog.com/2012/08/02/avoiding-sst-when-adding-new-percona-xtradb-cluster-node/[/url]
You can also adjust the quorum settings if there are any less reliable or less important nodes in the cluster:
[url]http://www.codership.com/wiki/doku.php?id=weighted_quorum[/url]

Thanks przemek for the links, yes i meant more so of a recovery from a total cluster failure, a scenario where node 2 and node 3 are down and node 1 is the only one active.
and to recover node 2 from node 1 we might need an SST. how would someone do it without assuming a few hours of downtime?

I will check those links now…

If node1 stays alive while two other nodes went down, there is still pretty good chance those nodes can re-join by doing IST. If node1 also goes into non-primary state, you only need to tell it it’s the primary component before joining the other nodes. The command to achieve this:
SET GLOBAL wsrep_provider_options=‘pc.bootstrap=true’;

Also, SST does not necessary mean complete downtime, the primary node which is a donor, can still serve queries, but it may be too slow to handle required workload though. Also, you can restore the other nodes from last backup and then also IST may be possible if gcache on primary node can store the transactions since last backup. Details in the link I posted before: [url]http://www.mysqlperformanceblog.com/2012/08/02/avoiding-sst-when-adding-new-percona-xtradb-cluster-node/[/url]