Node fails to join

Greetings everybody,

I’ve got a serious issue, i have a 3 node cluster and 2 of them went down for an unknown reason. Then the third also went down.
So i decided to restart hte cluster. Bootstrap the last one out and that ultimately went ok.
But restarting one of the other 2 nodes with “systemctl start mysql” it just hangs a long time and then fails.

Now i can see that the joiner starts SST and waits.
And i can also see on the Donor (bootstrapped) that it initiates this.

But after a while it just collapses or crashes, eventhough the donor says everything went ok.
Here’s the log out put:

Donor:
2019-01-30T22:23:02.953484Z 0 [Note] WSREP: Initiating SST/IST transfer on DONOR side (wsrep_sst_xtrabackup-v2 --role ‘donor’ --address ‘10.175.53.26:4444/xtrabackup_sst//1’ --socket ‘/var/lib/mysql/mysql.sock’ --datadir ‘/mysql-data/data/’ --defaults-file ‘/etc/my.cnf’ --defaults-group-suffix ‘’ ‘’ --gtid ‘25dbb128-bfd8-11e8-bb2c-47ff83f167eb:83217156’)
2019-01-30T22:23:02.955073Z 3 [Note] WSREP: DONOR thread signaled with 0
2019-01-30T22:23:04.662841Z 0 [Note] WSREP: (7531a70f, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2019-01-30T22:23:13.985705Z WSREP_SST: [INFO] Streaming the backup to joiner at 10.175.53.26 4444
2019-01-31T00:48:09.406861Z 0 [Note] WSREP: 0.0 (ams1): State transfer to 1.0 (ams3) complete.
2019-01-31T00:48:09.406909Z 0 [Note] WSREP: Shifting DONOR/DESYNCED → JOINED (TO: 83217205)
2019-01-31T00:48:09.407221Z 0 [Note] WSREP: Member 0.0 (ams1) synced with group.
2019-01-31T00:48:09.407231Z 0 [Note] WSREP: Shifting JOINED → SYNCED (TO: 83217205)
2019-01-31T00:48:09.407247Z 7 [Note] WSREP: Synchronized with group, ready for connections
2019-01-31T00:48:09.407256Z 7 [Note] WSREP: Setting wsrep_ready to true

Joiner:
2019-01-30T19:03:26.211039Z WSREP_SST: [INFO] …Waiting for SST streaming to complete!
2019-01-30T21:24:30.328062Z 0 [Note] WSREP: 1.0 (ams1): State transfer to 0.0 (ams3) complete.
2019-01-30T21:24:30.328410Z 0 [Note] WSREP: Member 1.0 (ams1) synced with group.
2019-01-30T21:24:30.355647Z WSREP_SST: [INFO] Preparing the backup at /mysql-data/data//.sst
Terminated
2019-01-30T21:27:04.883775Z WSREP_SST: [ERROR] Removing /mysql-data/data//.sst/xtrabackup_galera_info file due to signal
2019-01-30T21:27:04.892351Z WSREP_SST: [ERROR] Removing file due to signal
2019-01-30T21:27:04.895695Z WSREP_SST: [ERROR] Cleanup after exit with status:143
2019-01-30T21:27:04.927357Z 0 [ERROR] WSREP: Process was aborted.
2019-01-30T21:27:04.927403Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘10.175.53.26’ --datadir ‘/mysql-data/data/’ --defaults-file ‘/etc/my.cnf’ --defaults-group-suffix ‘’ --parent ‘25678’ ‘’ : 2 (No such file or directory)
2019-01-30T21:27:04.927469Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2019-01-30T21:27:04.927482Z 0 [ERROR] WSREP: SST script aborted with error 2 (No such file or directory)
2019-01-30T21:27:04.927582Z 0 [ERROR] WSREP: SST failed: 2 (No such file or directory)
2019-01-30T21:27:04.927604Z 0 [ERROR] Aborting
2019-01-30T21:27:04.927610Z 0 [Note] WSREP: Signalling cancellation of the SST request.
2019-01-30T21:27:04.927639Z 0 [Note] WSREP: SST request was cancelled
2019-01-30T21:27:04.927653Z 0 [Note] Giving 2 client threads a chance to die gracefully
2019-01-30T21:27:04.927696Z 2 [Note] WSREP: Closing send monitor…
2019-01-30T21:27:04.927706Z 2 [Note] WSREP: Closed send monitor.
2019-01-30T21:27:04.927719Z 2 [Note] WSREP: gcomm: terminating thread
2019-01-30T21:27:04.927763Z 2 [Note] WSREP: gcomm: joining thread
2019-01-30T21:27:04.927992Z 2 [Note] WSREP: gcomm: closing backend
2019-01-30T21:27:06.928779Z 0 [Note] WSREP: Waiting for active wsrep applier to exit
2019-01-30T21:27:06.928851Z 1 [Note] WSREP: rollbacker thread exiting
2019-01-30T21:27:08.228614Z 2 [Note] WSREP: (6e426773, ‘tcp://0.0.0.0:4567’) connection to peer 7531a70f with addr tcp://10.175.53.19:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2019-01-30T21:27:08.228882Z 2 [Note] WSREP: (6e426773, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.175.53.19:4567
2019-01-30T21:27:09.728625Z 2 [Note] WSREP: (6e426773, ‘tcp://0.0.0.0:4567’) reconnecting to 7531a70f (tcp://10.175.53.19:4567), attempt 0
2019-01-30T21:27:10.428280Z 2 [Note] WSREP: declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)
2019-01-30T21:27:10.428335Z 2 [Note] WSREP: evs::proto(6e426773, LEAVING, view_id(REG,6e426773,4)) suspecting node: 7531a70f
2019-01-30T21:27:10.428363Z 2 [Note] WSREP: evs::proto(6e426773, LEAVING, view_id(REG,6e426773,4)) suspected node without join message, declaring inactive
2019-01-30T21:27:10.428407Z 2 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,6e426773,4)
memb {
6e426773,0
}
joined {
}
left {
}
partitioned {
7531a70f,0
}
)
2019-01-30T21:27:10.428467Z 2 [Note] WSREP: Current view of cluster as seen by this node
view ((empty))
2019-01-30T21:27:10.428846Z 2 [Note] WSREP: gcomm: closed
2019-01-30T21:27:10.428904Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2019-01-30T21:27:10.428953Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2019-01-30T21:27:10.428961Z 0 [Note] WSREP: Trying to continue unpaused monitor
2019-01-30T21:27:10.428966Z 0 [Note] WSREP: Received NON-PRIMARY.
2019-01-30T21:27:10.428972Z 0 [Note] WSREP: Shifting JOINER → OPEN (TO: 83086190)
2019-01-30T21:27:10.428984Z 0 [Note] WSREP: Received self-leave message.
2019-01-30T21:27:10.428990Z 0 [Note] WSREP: Flow-control interval: [0, 0]
2019-01-30T21:27:10.428995Z 0 [Note] WSREP: Trying to continue unpaused monitor
2019-01-30T21:27:10.429000Z 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2019-01-30T21:27:10.429005Z 0 [Note] WSREP: Shifting OPEN → CLOSED (TO: 83086190)
2019-01-30T21:27:10.429014Z 0 [Note] WSREP: RECV thread exiting 0: Success
2019-01-30T21:27:10.429075Z 2 [Note] WSREP: recv_thread() joined.
2019-01-30T21:27:10.429084Z 2 [Note] WSREP: Closing replication queue.
2019-01-30T21:27:10.429090Z 2 [Note] WSREP: Closing slave action queue.
2019-01-30T21:27:10.429138Z 2 [Note] WSREP: Ignorng trx(83076904) due to SST failure
2019-01-30T21:27:10.429151Z 2 [Note] WSREP: Ignorng trx(83076905) due to SST failure

Hope anyone knows what this is about because i’m lost and new to this!
Here’s a link to: Donor log, joiner log, my.cnf

Extra info:
I came across people stating to increase the timeout in /usr/bin/mysql-systemd
Which i increased from 900 to 9000. But that’s about 2,5 hours for roughly 750GB of data. I do not know if that is long enough.

I also removed all the data in the grastate.dat folder (incl. the .dat file itself) before i initated the transfer.

Currently, the cluster is “working” from the applications perspective as that’s just pushing to the bootstrapped node. I don’t know if there’s serious dangers in leaving the bootstrapped node up for a day so that i can figure this out and not interfere with a live application.

Kind regards,
Jeffrey de Boer

Hi jdeboer, please attach the innobackup..log which can be found inside the datadir of the JOINER node. On that note, we’d also appreciate it if you can attach the innobackup..log from the DONOR node as well.

Hi Jrivera,

Sorry for the late reply, i fell a sleeep haha it was 4am here.
So yeah, i’ve uploaded the innodb log files that are available. the Donor had all the innodbbackup log files but the joiner only had the prepare log.

JOINER: innobackup.prepare.log
DONOR: innobackup.prepare.log
DONOR: innobackup.move.log
DONOR: innobackup.backup.log

Hope that’s enough. Thanks!

So just to update, i managed to get the cluster online again.
All i had to do was increase the timeout in /usr/bin/mysql-systemd to allow the service to execute longer.

This also explains why the logs on the Donor stated that the SST was completed but nothing showed on the Joiner.
I guess this is because at the last part after the SST the service start timed out.

I still don’t know why it all crashed in the first place, if anybody has any idea based on the logs i provided that would be most helpfull seeing as i’m a bit fearfull that this isn’t running stable now.

Kind regards,
Jeffrey de Boer