Node not joining cluster

jdeboer · January 30, 2019, 8:40pm

Greetings all,

Hope somebody is here to help me out because i’m at a loss.
For some reason two of my 3 nodes went down earlier today. Then after a while the third one also went down.

I couldn’t find the reason why they went down but i needed them back up quickly so i decided to bootstrap the last one out in order to get the cluster back up and running.
That went ok. That node is now bootstrapped. However when i try to start PXC on another node with “systemctl start mysql” it just hangs.
I can see in the logs that it’s waiting for the SST transfer but then after a while it just fails.

I’ve got the full logs in the bottom for full disclosure and the my.cnf.
But here’s what i could gather (snippets, removed some junk to keep post low in count)

Bootstrapped node:

2019-01-30T22:23:02.953203Z 0 [Note] WSREP: Member 1.0 (ams3) requested state transfer from '*any*'. Selected 0.0 (ams1)(SYNCED) as donor.
2019-01-30T22:23:02.953227Z 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 83217156)
2019-01-30T22:23:02.953253Z 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2019-01-30T22:23:02.953484Z 0 [Note] WSREP: Initiating SST/IST transfer on DONOR side (wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.175.53.26:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/mysql-data/data/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid '25dbb128-bfd8-11e8-bb2c-47ff83f167eb:83217156')
2019-01-30T22:23:02.955073Z 3 [Note] WSREP: DONOR thread signaled with 0
2019-01-30T22:23:04.662841Z 0 [Note] WSREP: (7531a70f, 'tcp://0.0.0.0:4567') turning message relay requesting off
2019-01-30T22:23:13.985705Z WSREP_SST: [INFO] Streaming the backup to joiner at 10.175.53.26 4444
2019-01-31T00:48:09.406861Z 0 [Note] WSREP: 0.0 (ams1): State transfer to 1.0 (ams3) complete.
2019-01-31T00:48:09.406909Z 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 83217205)
2019-01-31T00:48:09.407221Z 0 [Note] WSREP: Member 0.0 (ams1) synced with group.
2019-01-31T00:48:09.407231Z 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 83217205)

On the Joiner node:

2019-01-30T22:23:20.998849Z WSREP_SST: [INFO] Proceeding with SST.........
2019-01-30T22:23:21.014226Z WSREP_SST: [INFO] ............Waiting for SST streaming to complete!
2019-01-31T00:48:08.803819Z 0 [Note] WSREP: 0.0 (ams1): State transfer to 1.0 (ams3) complete.
2019-01-31T00:48:08.804145Z 0 [Note] WSREP: Member 0.0 (ams1) synced with group.
2019-01-31T00:48:08.815404Z WSREP_SST: [INFO] Preparing the backup at /mysql-data/data//.sst
Terminated
2019-01-31T00:52:35.437746Z WSREP_SST: [ERROR] Removing /mysql-data/data//.sst/xtrabackup_galera_info file due to signal
2019-01-31T00:52:35.441422Z WSREP_SST: [ERROR] Removing file due to signal
2019-01-31T00:52:35.444802Z WSREP_SST: [ERROR] Cleanup after exit with status:143
2019-01-31T00:52:35.466189Z 0 [ERROR] WSREP: Process was aborted.
2019-01-31T00:52:35.466237Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.175.53.26' --datadir '/mysql-data/data/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '6563' '' : 2 (No such file or directory)
2019-01-31T00:52:35.466263Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2019-01-31T00:52:35.466276Z 0 [ERROR] WSREP: SST script aborted with error 2 (No such file or directory)
2019-01-31T00:52:35.466338Z 0 [ERROR] WSREP: SST failed: 2 (No such file or directory)
2019-01-31T00:52:35.466360Z 0 [ERROR] Aborting

2019-01-31T00:52:35.466366Z 0 [Note] WSREP: Signalling cancellation of the SST request.
2019-01-31T00:52:35.466395Z 0 [Note] WSREP: SST request was cancelled
2019-01-31T00:52:35.466424Z 0 [Note] Giving 2 client threads a chance to die gracefully
2019-01-31T00:52:35.466471Z 1 [Note] WSREP: Closing send monitor...
2019-01-31T00:52:35.466502Z 1 [Note] WSREP: Closed send monitor.
2019-01-31T00:52:35.466524Z 1 [Note] WSREP: gcomm: terminating thread
2019-01-31T00:52:35.466538Z 1 [Note] WSREP: gcomm: joining thread
2019-01-31T00:52:35.466726Z 1 [Note] WSREP: gcomm: closing backend
2019-01-31T00:52:37.466603Z 0 [Note] WSREP: Waiting for active wsrep applier to exit
2019-01-31T00:52:37.466705Z 2 [Note] WSREP: rollbacker thread exiting
2019-01-31T00:52:38.577318Z 1 [Note] WSREP: (9a57e698, 'tcp://0.0.0.0:4567') connection to peer 7531a70f with addr tcp://10.175.53.19:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2019-01-31T00:52:38.577505Z 1 [Note] WSREP: (9a57e698, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.175.53.19:4567
2019-01-31T00:52:40.077331Z 1 [Note] WSREP: (9a57e698, 'tcp://0.0.0.0:4567') reconnecting to 7531a70f (tcp://10.175.53.19:4567), attempt 0
2019-01-31T00:52:40.967022Z 1 [Note] WSREP: declaring node with index 0 suspected, timeout PT5S (evs.suspect_timeout)
2019-01-31T00:52:40.967117Z 1 [Note] WSREP: evs::proto(9a57e698, LEAVING, view_id(REG,7531a70f,6)) suspecting node: 7531a70f
2019-01-31T00:52:40.967130Z 1 [Note] WSREP: evs::proto(9a57e698, LEAVING, view_id(REG,7531a70f,6)) suspected node without join message, declaring inactive
2019-01-31T00:52:40.967173Z 1 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,7531a70f,6)
memb {
9a57e698,0
}
joined {
}
left {
}
partitioned {
7531a70f,0
}
)
2019-01-31T00:52:40.967238Z 1 [Note] WSREP: Current view of cluster as seen by this node
view ((empty))
2019-01-31T00:52:40.967939Z 1 [Note] WSREP: gcomm: closed
2019-01-31T00:52:40.968000Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2019-01-31T00:52:40.968053Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2019-01-31T00:52:40.968073Z 0 [Note] WSREP: Trying to continue unpaused monitor
2019-01-31T00:52:40.968081Z 0 [Note] WSREP: Received NON-PRIMARY.
2019-01-31T00:52:40.968089Z 0 [Note] WSREP: Shifting JOINER -> OPEN (TO: 83233220)
2019-01-31T00:52:40.968110Z 0 [Note] WSREP: Received self-leave message.
2019-01-31T00:52:40.968142Z 0 [Note] WSREP: Flow-control interval: [0, 0]
2019-01-31T00:52:40.968167Z 0 [Note] WSREP: Trying to continue unpaused monitor
2019-01-31T00:52:40.968182Z 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2019-01-31T00:52:40.968205Z 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 83233220)
2019-01-31T00:52:40.968219Z 0 [Note] WSREP: RECV thread exiting 0: Success
2019-01-31T00:52:40.968289Z 1 [Note] WSREP: recv_thread() joined.
2019-01-31T00:52:40.968308Z 1 [Note] WSREP: Closing replication queue.
2019-01-31T00:52:40.968314Z 1 [Note] WSREP: Closing slave action queue.
2019-01-31T00:52:40.968367Z 1 [Note] WSREP: Ignorng trx(83217132) due to SST failure
2019-01-31T00:52:40.968381Z 1 [Note] WSREP: Ignorng trx(83217133) due to SST failure
... (the above lines repeat from here on out)

Hoping somebody can point me in the right direction because i’m new and lost to PXC and XtraDB

URLS:
Bootstrap log: [url]Dropbox - mysqld_bootstrapped.zip - Simplify your life
Joiner Log [url]Dropbox - mysqld_joiner.zip - Simplify your life
My.cnf [url]Dropbox - my.zip - Simplify your life

przemek · February 13, 2019, 8:11am

Hello,

It seems that the joiner cancelled the SST due to a termination signal it received. As you are using systemd, check if that may be service timeout issue - is the TimeoutStartSec set in the /etc/systemd/system/mysqld.service ?
Can you attach also innobackup.prepare.log and innobackup.move.log from the joiner node?

Topic		Replies	Views
Node fails to join Percona XtraDB Cluster 5.x	4	1500	February 1, 2019
PXC-8 - Failed to join after restarting the node Percona XtraDB Cluster 8.x mysql , percona	1	3381	October 29, 2020
SST Failure XtraDB Cluster 5.6.28 Percona XtraDB Cluster 5.x	1	646	November 9, 2016
Percona cluster node joining issue Percona XtraDB Cluster 8.x	3	1217	July 7, 2021
Crashed Cluste, Nodes not Joinig Percona XtraDB Cluster 5.x percona	9	986	September 8, 2022

Node not joining cluster

Related topics