Required best and quick Method for cluster recovery and failover..

madhusudan · December 17, 2013, 5:37am

Hi,

I am running a cluster with 3 nodes in ubuntu 12.04.3, had a hard time in setting up this cluster!, well everything was working fine until yesterday one of the node(node1) got rebooted automatically (becoz of some sys err). and just now recovered from failover in a different way!.

This is what I tried.

First tried with starting the node1 with /etc/init.d/mysql start ,I thought It would do IST becoz during the failover only few transaction happened!, but it started SST (instead of IST). and removed everything in datadir and failed stating “cannot perform SST: operation not permitted” I got confused becoz I made this node(node1) as primary and bootstrapped it earlier.
Then there were only 2 nodes left with data, so I decided to make node3 as primary, and tried to start node1, SST started but got paused for few seconds and then node1 throwed error
“WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)”.

I didn’t understood why this error was coming! as when I did telnet to other nodes and it got connected and after 2 seconds connection got closed by foreign host (is this normal or something that should be worried ?).

Despite of trying many times to start mysql got same error, even tried with deleting some log files(ib_logfile*,galera.cache…), was no luck,

4)Then I realized percona backup will also support rsync!, and did rsync manually from node3 datadir to node1 & node2 datadir, (I stopped node2 also becoz I wanted make all data identical), then started node1 and node2 and everything started correctly.

The above setup was a testing server so time was not a problem, if it were production servers then cannot afford this much downtime!. need quick and easy way to get all nodes ready.

So I am asking all the percona users and developers is there any standard procedure/steps/methods anywhere written completely including like during failover when and what should be the gcomm values and can we make all nodes primary? if yes how many we can make,what files needs to be deleted/modified in which node etc…
These type of Q&A summarized and available anywhere…? if not can we create one (may be thread in this forum).

przemek · December 20, 2013, 4:21pm

I think it looks like there is some network problem between the donor and joining node (node1). Is there any kind of firewall on them? How did you start your 3 nodes cluster in the first place? Were 2 other nodes created using SST from initial node? Anything meaningful in SST logs (innobackup.backup.log and innobackup.prepare.log)?
Anyways, there is documentation:
[URL=“Cluster Failover”]http://www.percona.com/doc/percona-x.../failover.html[/URL]
[URL=“Restarting the cluster nodes”]http://www.percona.com/doc/percona-x...ing_nodes.html[/URL]
http://www.percona.com/doc/percona-x…_transfer.html
etc.

madhusudan · December 23, 2013, 2:02am

There was no network problem, all are connected to same switch, there is no firewall between all 3 nodes, no iptables, apparmor is also disabled I was able to do telnet to the port 4567!.
out of 3 nodes I made node3 as primary and then tried starting node1,(node2 was already running), node1 tried SST but kept waiting…(for long time some 15 minutes until I killed the process), the [COLOR=#252C2F]innob[COLOR=#252C2F]ackup.backup.log file showed some continues “streaming… done” and “log scanned up to…” messages.

I have few doubts…

1)In any case if we do rsync from one node to all nodes, will it create any problem…? is this method correct…? (anyway it worked in my case).
2)If there is a “grastate.dat” file present and it has uuid “16c92ddb-5bee-1…” and “seqno: -1”, then will it perform SST or IST…?
3)If a node had grastate.dat with some “uuid:3ret34…” and “seqno:3423” (not -1), and then If I make grastate.dat identical for all nodes, will it perform SST or IST…? (I tried this method but it didn’t worked).
4)after all nodes up, what should be gcomm:/// variables in primary(bootstrapped) and non primary nodes…?

mysql --version
mysql Ver 14.14 Distrib 5.5.34, for Linux (x86_64) using readline 5.1

there are 3 xtrabackup versions
xtrabackup --version
xtrabackup version 2.1.6 for Percona Server 5.1.70 unknown-linux-gnu (x86_64) (revision id: 702)

xtrabackup_55 --version
xtrabackup_55 version 2.1.6 for Percona Server 5.5.31 Linux (x86_64) (revision id: 702)

xtrabackup_56 --version
xtrabackup_56 version 2.1.6 for MySQL server 5.6.11 Linux (x86_64) (revision id: 702)

Which one should I use…? (now I am using rsync).

madhusudan · January 16, 2014, 12:38am

@przemek,
waiting for the reply…

przemek · February 24, 2014, 7:13am

Hi, sorry, I recently was busy and lost track of this thread.

1 - What do you mean by “rsync”? The wsrep_sst_method=rsync, or manually copying the data? If the latter, it is not going to work, and it’s something that InnoDB does not support, unless you stop the MySQL daemon first.
2 and 3 - see this article how it is possible to avoid full SST when restoring a node from backup:
[url]http://www.mysqlperformanceblog.com/2012/08/02/avoiding-sst-when-adding-new-percona-xtradb-cluster-node/[/url]
4 - once all nodes are up, the gcomm:// should be the same on all nodes. In fact you can make it contain all nodes addresses from the beginning, you only have to bootstrap the first node by starting it for the first time with: “/etc/init.d/mysql bootstrap-pxc”

Regarding the xtrabackup binary version - it will be automatically chosen by the innobackupex script. You can also specify it manually, and it should be xtrabackup_55 for PXC 5.5, and xtrabackup_56 for PXC 5.6.

madhusudan · February 24, 2014, 7:50am

@przemek,
[COLOR=#252C2F]Thanks for the reply, rsync I meant manually copying the data dir to the down node, and the reason for doing manual rsync is because it was not starting!.
usually I try this as last option, and it worked on that day!.
anyway I updated to 5.6 now… and all 3 nodes are in sync.
it would be good if we have a kinda troubleshooting cookbook where common errors are addressed.

Topic		Replies	Views
Restore percona xtradb cluster by XtraBackup on GCP. Got some problem Percona XtraBackup	16	2807	April 22, 2023
Node Shutdown after start Percona XtraDB Cluster 5.x	12	3812	August 4, 2015
Joining second node without cluster lock Percona XtraDB Cluster 5.x	7	1443	June 15, 2016
Node is not connecting. Percona XtraDB Cluster 5.x	7	3123	September 9, 2020
XtraDBCluster 1 Node Crash Percona XtraDB Cluster 5.x community , mysql , percona	13	148	February 11, 2025

Required best and quick Method for cluster recovery and failover..

Related topics