Joining second node without cluster lock

Is there a way, when recovering from failed node scenarios, to recover from being reduced to a single node without locking the cluster when the next node rejoins?

I have a 3 node cluster with the addition of an arbitrator, and a node had to be taken out for maintenance. When it went to rejoin, it synched from one of the two remaining nodes, hung, and crashed, and took the donor node with it. This has left me in a single node state - and I;d like to be abel to get back to multi-node without taking an outage.

I’m using xtrabackup and xbstream as my SST method, but I’ve noticed when two nodes synch this way, it locks the donor node. Is there a way around this? Given enough nodes (even having just two already in synch) adding a third node allows one node to remain active and serve requests. But what do you do when you’re down to one node and need to recover without locking the whole cluster for an hour while the data synchs?

You can try manually SSTing a node with Xtrabackup. Try if this method works for you -

Say Node1 is the donor node and Node2 is the joiner


node1> innobackupex --galera-info /path/to/backup

Move/copy the backup to Node2


node2> innobackupex --apply-log /path/to/backup
node2> rm -rf /path/to/datadir
node2> cp -av /path/to/backup /path/to/datadir
node2> chown -R mysql:mysql /path/to/datadir

Check Galera GTID:


node2> cat /var/lib/mysql/xtrabackup_galera_info
8797f811-7f73-11e2-0800-8b513b3819c1:22809

Initialize the grastate.dat


node2> vim /var/lib/mysql/grastate.dat
node2> chown -R mysql:mysql /var/lib/mysql/grastate.dat
node2> cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 8797f811-7f73-11e2-0800-8b513b3819c1
seqno: 22809
cert_index: 

Then start node2


node2> service mysql start

If all goes well node2 should start with IST only.

I’ll give this a try! Thanks!

This worked perfectly.

Thanks so much!

So the node comes in, and lasts about5 minutes and is crashing. I’m getting the following in the mysql error log:

2016-01-11 14:47:28 26516 [Note] WSREP: IST received: c21bef5c-a863-11e5-a95e-86feb94b37d0:42183257
2016-01-11 14:47:28 26516 [Note] WSREP: 2.0 (moodledata03): State transfer from 1.0 (moodledata01) complete.
2016-01-11 14:47:28 26516 [Note] WSREP: Shifting JOINER -> JOINED (TO: 42218510)
2016-01-11 14:47:59 26516 [Note] WSREP: Member 2.0 (moodledata03) synced with group.
2016-01-11 14:47:59 26516 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 42221260)
2016-01-11 14:47:59 26516 [Note] WSREP: Synchronized with group, ready for connections
2016-01-11 14:47:59 26516 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
19:50:28 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=134217728
read_buffer_size=4194304
max_used_connections=28
max_threads=1002
thread_count=32
connection_count=23
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 12457966 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x3f88ce40
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Segmentation fault (core dumped)
160111 15:04:40 mysqld_safe Number of processes running now: 0
160111 15:04:40 mysqld_safe WSREP: not restarting wsrep node automatically
160111 15:04:40 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

Are there any known bugs with this method?

The crash doesn’t seem to be helpful. Are you able to start up this node again?

I am having this same exact issue after following the steps above.

2016-06-14 20:58:13 25109 [Note] WSREP: IST received: f904a9a7-db79-11e5-ae9e-6ac3a9358431:936038530
2016-06-14 20:58:13 25109 [Note] WSREP: 0.0 (10.0.3.20): State transfer from 1.0 (10.0.3.21) complete.
2016-06-14 20:58:13 25109 [Note] WSREP: Shifting JOINER → JOINED (TO: 936327934)
2016-06-14 21:03:44 25109 [Note] WSREP: Member 0.0 (10.0.3.20) synced with group.
2016-06-14 21:03:44 25109 [Note] WSREP: Shifting JOINED → SYNCED (TO: 936358072)
2016-06-14 21:03:45 25109 [Note] WSREP: Synchronized with group, ready for connections
2016-06-14 21:03:45 25109 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
02:04:31 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at [url]https://bugs.launchpad.net/percona-xtradb-cluster[/url]

key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=449
max_threads=2002
thread_count=311
connection_count=286
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1059535 K bytes of memory
Hope that’s ok; if not, decrease some variables in the equation.

Thread pointer: 0x14727f330
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
stack_bottom = 7f438c2f5d38 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x8fd375]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x666264]
/lib64/libpthread.so.0[0x3058e0f790]
[0x7f3f04015ea0]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f3f04004bf0): is an invalid pointer
Connection ID (thread ID): 4947
Status: NOT_KILLED

I have tried the same process on a different server and it worked without issue. signal 11 typically indicated a seg fault issue / memory issue. So I am going to reinstall mysql and wsrep etc on the node that wasnt working and then attemp the IST again.