Crashed Cluste, Nodes not Joinig

Hi there. My monitoring told me this morning, that my whole cluster ist down. I cloud bootstrap one Node, but the second and third node wont join the Cluster anymore. Here’s what my log tells me:

2022-09-07T09:26:17.675579Z 0 [Note] WSREP: discarding pending addr without UUID: tcp://172.30.0.65:4567
2022-09-07T09:26:18.027461Z 0 [Note] WSREP: gcomm: connected
2022-09-07T09:26:18.027598Z 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2022-09-07T09:26:18.027770Z 0 [Note] WSREP: Waiting for SST/IST to complete.
2022-09-07T09:26:18.027777Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2022-09-07T09:26:18.027935Z 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 20965914-2e8f-11ed-961c-bbc22ab886f3
2022-09-07T09:26:18.029104Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 20965914-2e8f-11ed-961c-bbc22ab886f3
2022-09-07T09:26:18.029470Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 20965914-2e8f-11ed-961c-bbc22ab886f3 from 0 (srv-db02)
2022-09-07T09:26:18.029779Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 20965914-2e8f-11ed-961c-bbc22ab886f3 from 1 (srv-db01)
2022-09-07T09:26:18.029807Z 0 [Note] WSREP: Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 11,
        members    = 1/2 (primary/total),
        act_id     = 52210161,
        last_appl. = -1,
        protocols  = 0/9/3 (gcs/repl/appl),
        group UUID = 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9
2022-09-07T09:26:18.029836Z 0 [Note] WSREP: Flow-control interval: [141, 141]
2022-09-07T09:26:18.029857Z 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 52210161)
2022-09-07T09:26:18.029945Z 1 [Note] WSREP: State transfer required:
        Group state: 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52210161
        Local state: 00000000-0000-0000-0000-000000000000:-1
2022-09-07T09:26:18.029979Z 1 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T09:26:18.029986Z 1 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T09:26:18.029999Z 1 [Note] WSREP: New cluster view: global state: 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52210161, view# 12: Primary, number of nodes: 2, my index: 0, protocol version 3
2022-09-07T09:26:18.030009Z 1 [Note] WSREP: Setting wsrep_ready to true
2022-09-07T09:26:18.030014Z 1 [Warning] WSREP: Gap in state sequence. Need state transfer.
2022-09-07T09:26:18.030019Z 1 [Note] WSREP: Setting wsrep_ready to false
2022-09-07T09:26:18.030116Z 0 [Note] WSREP: Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner' --address '172.30.0.64' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '7368' --mysqld-version '5.7.38-41-57'   '' )
        2022-09-07T09:26:18.367800Z WSREP_SST: [INFO] Streaming with xbstream
        2022-09-07T09:26:18.376212Z WSREP_SST: [WARNING] Found a stale sst_in_progress file: /var/lib/mysql//sst_in_progress
2022-09-07T09:26:18.702153Z 1 [Note] WSREP: Prepared SST/IST request: xtrabackup-v2|172.30.0.64:4444/xtrabackup_sst//1
2022-09-07T09:26:18.702234Z 1 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 1 -> 2)
2022-09-07T09:26:18.702272Z 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-09-07T09:26:18.702304Z 1 [Note] WSREP: Assign initial position for certification: 52210161, protocol version: 4
2022-09-07T09:26:18.702342Z 0 [Note] WSREP: Service thread queue flushed.
2022-09-07T09:26:18.702364Z 1 [Note] WSREP: Check if state gap can be serviced using IST
2022-09-07T09:26:18.702374Z 1 [Note] WSREP: Local UUID: 00000000-0000-0000-0000-000000000000 != Group UUID: 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9
2022-09-07T09:26:18.702424Z 1 [Note] WSREP: State gap can't be serviced using IST. Switching to SST
2022-09-07T09:26:18.702433Z 1 [Note] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():518. IST will be unavailable.
2022-09-07T09:26:18.703175Z 0 [Note] WSREP: Member 0.0 (srv-db02) requested state transfer from '*any*'. Selected 1.0 (srv-db01)(SYNCED) as donor.
2022-09-07T09:26:18.703212Z 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 52210161)
2022-09-07T09:26:18.703254Z 1 [Note] WSREP: Requesting state transfer: success, donor: 1
2022-09-07T09:26:18.703271Z 1 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52210161
        2022-09-07T09:26:19.454000Z WSREP_SST: [INFO] Streaming with xbstream
        2022-09-07T09:26:19.460987Z WSREP_SST: [INFO] WARNING: Stale temporary SST directory: /var/lib/mysql//.sst from previous state transfer. Removing
        2022-09-07T09:26:19.465540Z WSREP_SST: [INFO] Proceeding with SST.........
        2022-09-07T09:26:19.490339Z WSREP_SST: [INFO] ............Waiting for SST streaming to complete!
2022-09-07T09:26:21.027786Z 0 [Note] WSREP: (2049cf39, 'tcp://0.0.0.0:4567') turning message relay requesting off
        2022-09-07T09:26:30.144070Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
        2022-09-07T09:26:30.145611Z WSREP_SST: [ERROR] xtrabackup_checkpoints missing. xtrabackup/SST failed on DONOR. Check DONOR log
        2022-09-07T09:26:30.147132Z WSREP_SST: [ERROR] ******************************************************
        2022-09-07T09:26:30.148738Z WSREP_SST: [ERROR] Cleanup after exit with status:2
2022-09-07T09:26:30.160282Z 0 [Warning] WSREP: 1.0 (srv-db01): State transfer to 0.0 (srv-db02) failed: -22 (Invalid argument)
2022-09-07T09:26:30.160368Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():811: Will never receive state. Need to abort.
2022-09-07T09:26:30.160428Z 0 [Note] WSREP: gcomm: terminating thread
2022-09-07T09:26:30.160481Z 0 [Note] WSREP: gcomm: joining thread
2022-09-07T09:26:30.160598Z 0 [Note] WSREP: gcomm: closing backend
2022-09-07T09:26:30.524573Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '172.30.0.64' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '7368' --mysqld-version '5.7.38-41-57'   '' : 2 (No such file or directory)
2022-09-07T09:26:30.524671Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2022-09-07T09:26:30.524699Z 0 [ERROR] WSREP: SST script aborted with error 2 (No such file or directory)
2022-09-07T09:26:30.524750Z 0 [ERROR] WSREP: SST failed: 2 (No such file or directory)
2022-09-07T09:26:30.524773Z 0 [ERROR] Aborting

2022-09-07T09:26:30.524792Z 0 [Note] WSREP: Signalling cancellation of the SST request.
2022-09-07T09:26:30.524825Z 0 [Note] WSREP: SST request was cancelled
2022-09-07T09:26:30.524854Z 0 [Note] Giving 2 client threads a chance to die gracefully
2022-09-07T09:26:30.524889Z 1 [Note] WSREP: Closing send monitor...
2022-09-07T09:26:30.524958Z 1 [Note] WSREP: Closed send monitor.
2022-09-07T09:26:31.164254Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,2049cf39,1311)
memb {
        2049cf39,0
        }
joined {
        }
left {
        }
partitioned {
        b4cfd8a6,0
        }
)
2022-09-07T09:26:31.164322Z 0 [Note] WSREP: Current view of cluster as seen by this node
view ((empty))
2022-09-07T09:26:31.164455Z 0 [Note] WSREP: gcomm: closed
2022-09-07T09:26:31.164472Z 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.

Can someone please help me an tell me how to recover the second and third node?

Thanks in advance.

1 Like

Hello, can you check the logs on the donor side? that should give you the root cause of SST failure

1 Like

Sorry, for that dumb question, but what is the Donor side?

1 Like

When you add a node to the PXC cluster the data is copied over from a “donor” node. Check the logs on the donor server for clues.

1 Like

Ok. I went to my bootstrapped Database Server and watched mysqld.log, but there is nothing happening when i start my second Databaseserver.

1 Like

I dont know why, but the second Databaseserver wont log anything to it self right now. So i fired up the third Databaseserver wich also doesnt join the cluster and the i got some Logs from the Donor Side:

09:40.169887Z 0 [Note] WSREP: (b4cfd8a6, 'tcp://0.0.0.0:4567') connection established to 54de9a1f tcp://172.30.0.65:4567
2022-09-07T13:09:40.170382Z 0 [Note] WSREP: (b4cfd8a6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2022-09-07T13:09:40.369188Z 0 [Note] WSREP: declaring 54de9a1f at tcp://172.30.0.65:4567 stable
2022-09-07T13:09:40.369529Z 0 [Note] WSREP: Node b4cfd8a6 state primary
2022-09-07T13:09:40.372186Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,54de9a1f,1313)
memb {
        54de9a1f,0
        b4cfd8a6,0
        }
joined {
        }
left {
        }
partitioned {
        }
)
2022-09-07T13:09:40.372214Z 0 [Note] WSREP: Save the discovered primary-component to disk
2022-09-07T13:09:40.374370Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2022-09-07T13:09:40.374397Z 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2022-09-07T13:09:40.669445Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 5530533a-2eae-11ed-9123-d3b2a26834eb
2022-09-07T13:09:40.669651Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 5530533a-2eae-11ed-9123-d3b2a26834eb from 0 (srv-db03)
2022-09-07T13:09:40.669665Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 5530533a-2eae-11ed-9123-d3b2a26834eb from 1 (srv-db01)
2022-09-07T13:09:40.669675Z 0 [Note] WSREP: Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 13,
        members    = 1/2 (primary/total),
        act_id     = 52221005,
        last_appl. = 52220894,
        protocols  = 0/9/3 (gcs/repl/appl),
        group UUID = 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9
2022-09-07T13:09:40.669684Z 0 [Note] WSREP: Flow-control interval: [141, 141]
2022-09-07T13:09:40.669797Z 7 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T13:09:40.669827Z 7 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T13:09:40.669839Z 7 [Note] WSREP: New cluster view: global state: 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52221005, view# 14: Primary, number of nodes: 2, my index: 1, protocol version 3
2022-09-07T13:09:40.669845Z 7 [Note] WSREP: Setting wsrep_ready to true
2022-09-07T13:09:40.669851Z 7 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 2) (Increment: 1 -> 2)
2022-09-07T13:09:40.669857Z 7 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-09-07T13:09:40.670069Z 7 [Note] WSREP: Assign initial position for certification: 52221005, protocol version: 4
2022-09-07T13:09:40.670165Z 0 [Note] WSREP: Service thread queue flushed.
2022-09-07T13:09:41.347037Z 0 [Note] WSREP: Member 0.0 (srv-db03) requested state transfer from '*any*'. Selected 1.0 (srv-db01)(SYNCED) as donor.
2022-09-07T13:09:41.347088Z 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 52221005)
2022-09-07T13:09:41.347143Z 7 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-09-07T13:09:41.347262Z 0 [Note] WSREP: Initiating SST/IST transfer on DONOR side (wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.30.0.65:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.38-41-57'   '' --gtid '8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52221005' )
2022-09-07T13:09:41.347762Z 7 [Note] WSREP: DONOR thread signaled with 0
        2022-09-07T13:09:41.695201Z WSREP_SST: [INFO] Streaming with xbstream
2022-09-07T13:09:43.375188Z 0 [Note] WSREP: (b4cfd8a6, 'tcp://0.0.0.0:4567') turning message relay requesting off
        2022-09-07T13:09:52.740847Z WSREP_SST: [INFO] Streaming the backup to joiner at 172.30.0.65 4444
2022-09-07T13:09:52.750106Z 9627 [Note] Access denied for user 'adam'@'localhost' (using password: NO)
        2022-09-07T13:09:52.753128Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
        2022-09-07T13:09:52.755891Z WSREP_SST: [ERROR] xtrabackup finished with error: 1.  Check /var/lib/mysql//innobackup.backup.log
--------------- innobackup.backup.log (START) --------------------
xtrabackup: recognized server arguments: --server-id=1 --datadir=/var/lib/mysql --log_bin --defaults_group=mysqld --parallel=4
xtrabackup: recognized client arguments: --port=3306 --socket=/var/run/mysqld/mysqld.sock --no-version-check=1 --socket=/var/run/mysqld/mysqld.sock --lock-ddl=1 --backup=1 --galera-info=1 --binlog-info=ON --stream=xbstream --target-dir=/tmp/pxc_sst_qpMQ/donor_xb_YAAT
220907 13:09:52 Connecting to MySQL server host: localhost, user: not set, password: not set, port: 3306, socket: /var/run/mysqld/mysqld.sock
Failed to connect to MySQL server: Access denied for user 'adam'@'localhost' (using password: NO).
--------------- innobackup.backup.log (END) ----------------------
        2022-09-07T13:09:52.759127Z WSREP_SST: [ERROR] ******************************************************
        2022-09-07T13:09:52.760826Z WSREP_SST: [ERROR] Cleanup after exit with status:22
2022-09-07T13:09:52.772575Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.30.0.65:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.38-41-57'   '' --gtid '8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52221005' : 22 (Invalid argument)
2022-09-07T13:09:52.772650Z 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.30.0.65:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.38-41-57'   '' --gtid '8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52221005'
2022-09-07T13:09:52.773101Z 0 [Warning] WSREP: 1.0 (srv-db01): State transfer to 0.0 (srv-db03) failed: -22 (Invalid argument)
2022-09-07T13:09:52.773120Z 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 52221005)
2022-09-07T13:09:53.773826Z 0 [Note] WSREP: forgetting 54de9a1f (tcp://172.30.0.65:4567)
2022-09-07T13:09:53.773842Z 0 [Note] WSREP: Member 1.0 (srv-db01) synced with group.
2022-09-07T13:09:53.773920Z 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 52221005)
2022-09-07T13:09:53.773896Z 0 [Note] WSREP: Node b4cfd8a6 state primary
2022-09-07T13:09:53.773974Z 2 [Note] WSREP: Synchronized with group, ready for connections
2022-09-07T13:09:53.774003Z 2 [Note] WSREP: Setting wsrep_ready to true
2022-09-07T13:09:53.774012Z 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-09-07T13:09:53.773986Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,b4cfd8a6,1314)
memb {
        b4cfd8a6,0
        }
joined {
        }
left {
        }
partitioned {
        54de9a1f,0
        }
)
2022-09-07T13:09:53.774077Z 0 [Note] WSREP: Save the discovered primary-component to disk
2022-09-07T13:09:53.777623Z 0 [Note] WSREP: forgetting 54de9a1f (tcp://172.30.0.65:4567)
2022-09-07T13:09:53.777647Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
2022-09-07T13:09:53.777805Z 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 5cff501c-2eae-11ed-99de-125c8611f164
2022-09-07T13:09:53.777823Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 5cff501c-2eae-11ed-99de-125c8611f164
2022-09-07T13:09:53.777833Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 5cff501c-2eae-11ed-99de-125c8611f164 from 0 (srv-db01)
2022-09-07T13:09:53.777843Z 0 [Note] WSREP: Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 14,
        members    = 1/1 (primary/total),
        act_id     = 52221005,
        last_appl. = 52220894,
        protocols  = 0/9/3 (gcs/repl/appl),
        group UUID = 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9
2022-09-07T13:09:53.777850Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2022-09-07T13:09:53.777961Z 3 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T13:09:53.777993Z 3 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T13:09:53.778010Z 3 [Note] WSREP: New cluster view: global state: 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52221005, view# 15: Primary, number of nodes: 1, my index: 0, protocol version 3
2022-09-07T13:09:53.778020Z 3 [Note] WSREP: Setting wsrep_ready to true
2022-09-07T13:09:53.778026Z 3 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 2 -> 1) (Increment: 2 -> 1)
2022-09-07T13:09:53.778032Z 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-09-07T13:09:53.778039Z 3 [Note] WSREP: Assign initial position for certification: 52221005, protocol version: 4
2022-09-07T13:09:53.778061Z 0 [Note] WSREP: Service thread queue flushed.
2022-09-07T13:09:58.880792Z 0 [Note] WSREP:  cleaning up 54de9a1f (tcp://172.30.0.65:4567)

1 Like

Based on the logs there seems to be a permissions issue with the user you are configuring for SST. Try granting:

GRANT RELOAD, PROCESS, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'adam'@'localhost';
1 Like

Strange… there is no user adam@localhost in that Database. Where did i configure the SST User?

1 Like

I dont know why, but in my wsrep.conf was no SST Configured. I just entered a valid user and now my nodes could join. Thanks for the help.

1 Like

glad to hear you sorted it out

1 Like