Greetings everybody,
I’ve got a serious issue, i have a 3 node cluster and 2 of them went down for an unknown reason. Then the third also went down.
So i decided to restart hte cluster. Bootstrap the last one out and that ultimately went ok.
But restarting one of the other 2 nodes with “systemctl start mysql” it just hangs a long time and then fails.
Now i can see that the joiner starts SST and waits.
And i can also see on the Donor (bootstrapped) that it initiates this.
But after a while it just collapses or crashes, eventhough the donor says everything went ok.
Here’s the log out put:
Donor:
2019-01-30T22:23:02.953484Z 0 [Note] WSREP: Initiating SST/IST transfer on DONOR side (wsrep_sst_xtrabackup-v2 --role ‘donor’ --address ‘10.175.53.26:4444/xtrabackup_sst//1’ --socket ‘/var/lib/mysql/mysql.sock’ --datadir ‘/mysql-data/data/’ --defaults-file ‘/etc/my.cnf’ --defaults-group-suffix ‘’ ‘’ --gtid ‘25dbb128-bfd8-11e8-bb2c-47ff83f167eb:83217156’)
2019-01-30T22:23:02.955073Z 3 [Note] WSREP: DONOR thread signaled with 0
2019-01-30T22:23:04.662841Z 0 [Note] WSREP: (7531a70f, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2019-01-30T22:23:13.985705Z WSREP_SST: [INFO] Streaming the backup to joiner at 10.175.53.26 4444
2019-01-31T00:48:09.406861Z 0 [Note] WSREP: 0.0 (ams1): State transfer to 1.0 (ams3) complete.
2019-01-31T00:48:09.406909Z 0 [Note] WSREP: Shifting DONOR/DESYNCED → JOINED (TO: 83217205)
2019-01-31T00:48:09.407221Z 0 [Note] WSREP: Member 0.0 (ams1) synced with group.
2019-01-31T00:48:09.407231Z 0 [Note] WSREP: Shifting JOINED → SYNCED (TO: 83217205)
2019-01-31T00:48:09.407247Z 7 [Note] WSREP: Synchronized with group, ready for connections
2019-01-31T00:48:09.407256Z 7 [Note] WSREP: Setting wsrep_ready to true
Joiner:
2019-01-30T19:03:26.211039Z WSREP_SST: [INFO] …Waiting for SST streaming to complete!
2019-01-30T21:24:30.328062Z 0 [Note] WSREP: 1.0 (ams1): State transfer to 0.0 (ams3) complete.
2019-01-30T21:24:30.328410Z 0 [Note] WSREP: Member 1.0 (ams1) synced with group.
2019-01-30T21:24:30.355647Z WSREP_SST: [INFO] Preparing the backup at /mysql-data/data//.sst
Terminated
2019-01-30T21:27:04.883775Z WSREP_SST: [ERROR] Removing /mysql-data/data//.sst/xtrabackup_galera_info file due to signal
2019-01-30T21:27:04.892351Z WSREP_SST: [ERROR] Removing file due to signal
2019-01-30T21:27:04.895695Z WSREP_SST: [ERROR] Cleanup after exit with status:143
2019-01-30T21:27:04.927357Z 0 [ERROR] WSREP: Process was aborted.
2019-01-30T21:27:04.927403Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘10.175.53.26’ --datadir ‘/mysql-data/data/’ --defaults-file ‘/etc/my.cnf’ --defaults-group-suffix ‘’ --parent ‘25678’ ‘’ : 2 (No such file or directory)
2019-01-30T21:27:04.927469Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2019-01-30T21:27:04.927482Z 0 [ERROR] WSREP: SST script aborted with error 2 (No such file or directory)
2019-01-30T21:27:04.927582Z 0 [ERROR] WSREP: SST failed: 2 (No such file or directory)
2019-01-30T21:27:04.927604Z 0 [ERROR] Aborting
2019-01-30T21:27:04.927610Z 0 [Note] WSREP: Signalling cancellation of the SST request.
2019-01-30T21:27:04.927639Z 0 [Note] WSREP: SST request was cancelled
2019-01-30T21:27:04.927653Z 0 [Note] Giving 2 client threads a chance to die gracefully
2019-01-30T21:27:04.927696Z 2 [Note] WSREP: Closing send monitor…
2019-01-30T21:27:04.927706Z 2 [Note] WSREP: Closed send monitor.
2019-01-30T21:27:04.927719Z 2 [Note] WSREP: gcomm: terminating thread
2019-01-30T21:27:04.927763Z 2 [Note] WSREP: gcomm: joining thread
2019-01-30T21:27:04.927992Z 2 [Note] WSREP: gcomm: closing backend
2019-01-30T21:27:06.928779Z 0 [Note] WSREP: Waiting for active wsrep applier to exit
2019-01-30T21:27:06.928851Z 1 [Note] WSREP: rollbacker thread exiting
2019-01-30T21:27:08.228614Z 2 [Note] WSREP: (6e426773, ‘tcp://0.0.0.0:4567’) connection to peer 7531a70f with addr tcp://10.175.53.19:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2019-01-30T21:27:08.228882Z 2 [Note] WSREP: (6e426773, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.175.53.19:4567
2019-01-30T21:27:09.728625Z 2 [Note] WSREP: (6e426773, ‘tcp://0.0.0.0:4567’) reconnecting to 7531a70f (tcp://10.175.53.19:4567), attempt 0
2019-01-30T21:27:10.428280Z 2 [Note] WSREP: declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)
2019-01-30T21:27:10.428335Z 2 [Note] WSREP: evs::proto(6e426773, LEAVING, view_id(REG,6e426773,4)) suspecting node: 7531a70f
2019-01-30T21:27:10.428363Z 2 [Note] WSREP: evs::proto(6e426773, LEAVING, view_id(REG,6e426773,4)) suspected node without join message, declaring inactive
2019-01-30T21:27:10.428407Z 2 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,6e426773,4)
memb {
6e426773,0
}
joined {
}
left {
}
partitioned {
7531a70f,0
}
)
2019-01-30T21:27:10.428467Z 2 [Note] WSREP: Current view of cluster as seen by this node
view ((empty))
2019-01-30T21:27:10.428846Z 2 [Note] WSREP: gcomm: closed
2019-01-30T21:27:10.428904Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2019-01-30T21:27:10.428953Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2019-01-30T21:27:10.428961Z 0 [Note] WSREP: Trying to continue unpaused monitor
2019-01-30T21:27:10.428966Z 0 [Note] WSREP: Received NON-PRIMARY.
2019-01-30T21:27:10.428972Z 0 [Note] WSREP: Shifting JOINER → OPEN (TO: 83086190)
2019-01-30T21:27:10.428984Z 0 [Note] WSREP: Received self-leave message.
2019-01-30T21:27:10.428990Z 0 [Note] WSREP: Flow-control interval: [0, 0]
2019-01-30T21:27:10.428995Z 0 [Note] WSREP: Trying to continue unpaused monitor
2019-01-30T21:27:10.429000Z 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2019-01-30T21:27:10.429005Z 0 [Note] WSREP: Shifting OPEN → CLOSED (TO: 83086190)
2019-01-30T21:27:10.429014Z 0 [Note] WSREP: RECV thread exiting 0: Success
2019-01-30T21:27:10.429075Z 2 [Note] WSREP: recv_thread() joined.
2019-01-30T21:27:10.429084Z 2 [Note] WSREP: Closing replication queue.
2019-01-30T21:27:10.429090Z 2 [Note] WSREP: Closing slave action queue.
2019-01-30T21:27:10.429138Z 2 [Note] WSREP: Ignorng trx(83076904) due to SST failure
2019-01-30T21:27:10.429151Z 2 [Note] WSREP: Ignorng trx(83076905) due to SST failure
Hope anyone knows what this is about because i’m lost and new to this!
Here’s a link to: Donor log, joiner log, my.cnf