Required best and quick Method for cluster recovery and failover..

Hi,

I am running a cluster with 3 nodes in ubuntu 12.04.3, had a hard time in setting up this cluster!, well everything was working fine until yesterday one of the node(node1) got rebooted automatically (becoz of some sys err). and just now recovered from failover in a different way!.

This is what I tried.

  1. First tried with starting the node1 with /etc/init.d/mysql start ,I thought It would do IST becoz during the failover only few transaction happened!, but it started SST (instead of IST). and removed everything in datadir and failed stating “cannot perform SST: operation not permitted” I got confused becoz I made this node(node1) as primary and bootstrapped it earlier.

  2. Then there were only 2 nodes left with data, so I decided to make node3 as primary, and tried to start node1, SST started but got paused for few seconds and then node1 throwed error
    “WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)”.

I didn’t understood why this error was coming! as when I did telnet to other nodes and it got connected and after 2 seconds connection got closed by foreign host (is this normal or something that should be worried ?).

  1. Despite of trying many times to start mysql got same error, even tried with deleting some log files(ib_logfile*,galera.cache…), was no luck,

4)Then I realized percona backup will also support rsync!, and did rsync manually from node3 datadir to node1 & node2 datadir, (I stopped node2 also becoz I wanted make all data identical), then started node1 and node2 and everything started correctly.

The above setup was a testing server so time was not a problem, if it were production servers then cannot afford this much downtime!. need quick and easy way to get all nodes ready.

So I am asking all the percona users and developers is there any standard procedure/steps/methods anywhere written completely including like during failover when and what should be the gcomm values and can we make all nodes primary? if yes how many we can make,what files needs to be deleted/modified in which node etc…
These type of Q&A summarized and available anywhere…? if not can we create one (may be thread in this forum).

I think it looks like there is some network problem between the donor and joining node (node1). Is there any kind of firewall on them? How did you start your 3 nodes cluster in the first place? Were 2 other nodes created using SST from initial node? Anything meaningful in SST logs (innobackup.backup.log and innobackup.prepare.log)?
Anyways, there is documentation:
[URL=“Cluster Failover”]http://www.percona.com/doc/percona-x.../failover.html[/URL]
[URL=“Restarting the cluster nodes”]http://www.percona.com/doc/percona-x...ing_nodes.html[/URL]
http://www.percona.com/doc/percona-x…_transfer.html
etc.

There was no network problem, all are connected to same switch, there is no firewall between all 3 nodes, no iptables, apparmor is also disabled I was able to do telnet to the port 4567!.
out of 3 nodes I made node3 as primary and then tried starting node1,(node2 was already running), node1 tried SST but kept waiting…(for long time some 15 minutes until I killed the process), the [COLOR=#252C2F]innob[COLOR=#252C2F]ackup.backup.log file showed some continues “streaming… done” and “log scanned up to…” messages.

I have few doubts…

1)In any case if we do rsync from one node to all nodes, will it create any problem…? is this method correct…? (anyway it worked in my case).
2)If there is a “grastate.dat” file present and it has uuid “16c92ddb-5bee-1…” and “seqno: -1”, then will it perform SST or IST…?
3)If a node had grastate.dat with some “uuid:3ret34…” and “seqno:3423” (not -1), and then If I make grastate.dat identical for all nodes, will it perform SST or IST…? (I tried this method but it didn’t worked).
4)after all nodes up, what should be gcomm:/// variables in primary(bootstrapped) and non primary nodes…?

mysql --version
mysql Ver 14.14 Distrib 5.5.34, for Linux (x86_64) using readline 5.1

there are 3 xtrabackup versions
xtrabackup --version
xtrabackup version 2.1.6 for Percona Server 5.1.70 unknown-linux-gnu (x86_64) (revision id: 702)

xtrabackup_55 --version
xtrabackup_55 version 2.1.6 for Percona Server 5.5.31 Linux (x86_64) (revision id: 702)

xtrabackup_56 --version
xtrabackup_56 version 2.1.6 for MySQL server 5.6.11 Linux (x86_64) (revision id: 702)

Which one should I use…? (now I am using rsync).

@przemek,
waiting for the reply…

Hi, sorry, I recently was busy and lost track of this thread.

1 - What do you mean by “rsync”? The wsrep_sst_method=rsync, or manually copying the data? If the latter, it is not going to work, and it’s something that InnoDB does not support, unless you stop the MySQL daemon first.
2 and 3 - see this article how it is possible to avoid full SST when restoring a node from backup:
[url]http://www.mysqlperformanceblog.com/2012/08/02/avoiding-sst-when-adding-new-percona-xtradb-cluster-node/[/url]
4 - once all nodes are up, the gcomm:// should be the same on all nodes. In fact you can make it contain all nodes addresses from the beginning, you only have to bootstrap the first node by starting it for the first time with: “/etc/init.d/mysql bootstrap-pxc”

Regarding the xtrabackup binary version - it will be automatically chosen by the innobackupex script. You can also specify it manually, and it should be xtrabackup_55 for PXC 5.5, and xtrabackup_56 for PXC 5.6.

@przemek,
[COLOR=#252C2F]Thanks for the reply, rsync I meant manually copying the data dir to the down node, and the reason for doing manual rsync is because it was not starting!.
usually I try this as last option, and it worked on that day!.
anyway I updated to 5.6 now… and all 3 nodes are in sync.
it would be good if we have a kinda troubleshooting cookbook where common errors are addressed.