Crashed Cluste, Nodes not Joinig

Hi there. My monitoring told me this morning, that my whole cluster ist down. I cloud bootstrap one Node, but the second and third node wont join the Cluster anymore. Here’s what my log tells me:

2022-09-07T09:26:17.675579Z 0 [Note] WSREP: discarding pending addr without UUID: tcp://172.30.0.65:4567
2022-09-07T09:26:18.027461Z 0 [Note] WSREP: gcomm: connected
2022-09-07T09:26:18.027598Z 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2022-09-07T09:26:18.027770Z 0 [Note] WSREP: Waiting for SST/IST to complete.
2022-09-07T09:26:18.027777Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2022-09-07T09:26:18.027935Z 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 20965914-2e8f-11ed-961c-bbc22ab886f3
2022-09-07T09:26:18.029104Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 20965914-2e8f-11ed-961c-bbc22ab886f3
2022-09-07T09:26:18.029470Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 20965914-2e8f-11ed-961c-bbc22ab886f3 from 0 (srv-db02)
2022-09-07T09:26:18.029779Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 20965914-2e8f-11ed-961c-bbc22ab886f3 from 1 (srv-db01)
2022-09-07T09:26:18.029807Z 0 [Note] WSREP: Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 11,
        members    = 1/2 (primary/total),
        act_id     = 52210161,
        last_appl. = -1,
        protocols  = 0/9/3 (gcs/repl/appl),
        group UUID = 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9
2022-09-07T09:26:18.029836Z 0 [Note] WSREP: Flow-control interval: [141, 141]
2022-09-07T09:26:18.029857Z 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 52210161)
2022-09-07T09:26:18.029945Z 1 [Note] WSREP: State transfer required:
        Group state: 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52210161
        Local state: 00000000-0000-0000-0000-000000000000:-1
2022-09-07T09:26:18.029979Z 1 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T09:26:18.029986Z 1 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T09:26:18.029999Z 1 [Note] WSREP: New cluster view: global state: 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52210161, view# 12: Primary, number of nodes: 2, my index: 0, protocol version 3
2022-09-07T09:26:18.030009Z 1 [Note] WSREP: Setting wsrep_ready to true
2022-09-07T09:26:18.030014Z 1 [Warning] WSREP: Gap in state sequence. Need state transfer.
2022-09-07T09:26:18.030019Z 1 [Note] WSREP: Setting wsrep_ready to false
2022-09-07T09:26:18.030116Z 0 [Note] WSREP: Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner' --address '172.30.0.64' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '7368' --mysqld-version '5.7.38-41-57'   '' )
        2022-09-07T09:26:18.367800Z WSREP_SST: [INFO] Streaming with xbstream
        2022-09-07T09:26:18.376212Z WSREP_SST: [WARNING] Found a stale sst_in_progress file: /var/lib/mysql//sst_in_progress
2022-09-07T09:26:18.702153Z 1 [Note] WSREP: Prepared SST/IST request: xtrabackup-v2|172.30.0.64:4444/xtrabackup_sst//1
2022-09-07T09:26:18.702234Z 1 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 1 -> 2)
2022-09-07T09:26:18.702272Z 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-09-07T09:26:18.702304Z 1 [Note] WSREP: Assign initial position for certification: 52210161, protocol version: 4
2022-09-07T09:26:18.702342Z 0 [Note] WSREP: Service thread queue flushed.
2022-09-07T09:26:18.702364Z 1 [Note] WSREP: Check if state gap can be serviced using IST
2022-09-07T09:26:18.702374Z 1 [Note] WSREP: Local UUID: 00000000-0000-0000-0000-000000000000 != Group UUID: 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9
2022-09-07T09:26:18.702424Z 1 [Note] WSREP: State gap can't be serviced using IST. Switching to SST
2022-09-07T09:26:18.702433Z 1 [Note] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():518. IST will be unavailable.
2022-09-07T09:26:18.703175Z 0 [Note] WSREP: Member 0.0 (srv-db02) requested state transfer from '*any*'. Selected 1.0 (srv-db01)(SYNCED) as donor.
2022-09-07T09:26:18.703212Z 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 52210161)
2022-09-07T09:26:18.703254Z 1 [Note] WSREP: Requesting state transfer: success, donor: 1
2022-09-07T09:26:18.703271Z 1 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52210161
        2022-09-07T09:26:19.454000Z WSREP_SST: [INFO] Streaming with xbstream
        2022-09-07T09:26:19.460987Z WSREP_SST: [INFO] WARNING: Stale temporary SST directory: /var/lib/mysql//.sst from previous state transfer. Removing
        2022-09-07T09:26:19.465540Z WSREP_SST: [INFO] Proceeding with SST.........
        2022-09-07T09:26:19.490339Z WSREP_SST: [INFO] ............Waiting for SST streaming to complete!
2022-09-07T09:26:21.027786Z 0 [Note] WSREP: (2049cf39, 'tcp://0.0.0.0:4567') turning message relay requesting off
        2022-09-07T09:26:30.144070Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
        2022-09-07T09:26:30.145611Z WSREP_SST: [ERROR] xtrabackup_checkpoints missing. xtrabackup/SST failed on DONOR. Check DONOR log
        2022-09-07T09:26:30.147132Z WSREP_SST: [ERROR] ******************************************************
        2022-09-07T09:26:30.148738Z WSREP_SST: [ERROR] Cleanup after exit with status:2
2022-09-07T09:26:30.160282Z 0 [Warning] WSREP: 1.0 (srv-db01): State transfer to 0.0 (srv-db02) failed: -22 (Invalid argument)
2022-09-07T09:26:30.160368Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():811: Will never receive state. Need to abort.
2022-09-07T09:26:30.160428Z 0 [Note] WSREP: gcomm: terminating thread
2022-09-07T09:26:30.160481Z 0 [Note] WSREP: gcomm: joining thread
2022-09-07T09:26:30.160598Z 0 [Note] WSREP: gcomm: closing backend
2022-09-07T09:26:30.524573Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '172.30.0.64' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '7368' --mysqld-version '5.7.38-41-57'   '' : 2 (No such file or directory)
2022-09-07T09:26:30.524671Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2022-09-07T09:26:30.524699Z 0 [ERROR] WSREP: SST script aborted with error 2 (No such file or directory)
2022-09-07T09:26:30.524750Z 0 [ERROR] WSREP: SST failed: 2 (No such file or directory)
2022-09-07T09:26:30.524773Z 0 [ERROR] Aborting

2022-09-07T09:26:30.524792Z 0 [Note] WSREP: Signalling cancellation of the SST request.
2022-09-07T09:26:30.524825Z 0 [Note] WSREP: SST request was cancelled
2022-09-07T09:26:30.524854Z 0 [Note] Giving 2 client threads a chance to die gracefully
2022-09-07T09:26:30.524889Z 1 [Note] WSREP: Closing send monitor...
2022-09-07T09:26:30.524958Z 1 [Note] WSREP: Closed send monitor.
2022-09-07T09:26:31.164254Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,2049cf39,1311)
memb {
        2049cf39,0
        }
joined {
        }
left {
        }
partitioned {
        b4cfd8a6,0
        }
)
2022-09-07T09:26:31.164322Z 0 [Note] WSREP: Current view of cluster as seen by this node
view ((empty))
2022-09-07T09:26:31.164455Z 0 [Note] WSREP: gcomm: closed
2022-09-07T09:26:31.164472Z 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.

Can someone please help me an tell me how to recover the second and third node?

Thanks in advance.

Hello, can you check the logs on the donor side? that should give you the root cause of SST failure

Sorry, for that dumb question, but what is the Donor side?

When you add a node to the PXC cluster the data is copied over from a “donor” node. Check the logs on the donor server for clues.

Ok. I went to my bootstrapped Database Server and watched mysqld.log, but there is nothing happening when i start my second Databaseserver.

I dont know why, but the second Databaseserver wont log anything to it self right now. So i fired up the third Databaseserver wich also doesnt join the cluster and the i got some Logs from the Donor Side:

09:40.169887Z 0 [Note] WSREP: (b4cfd8a6, 'tcp://0.0.0.0:4567') connection established to 54de9a1f tcp://172.30.0.65:4567
2022-09-07T13:09:40.170382Z 0 [Note] WSREP: (b4cfd8a6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2022-09-07T13:09:40.369188Z 0 [Note] WSREP: declaring 54de9a1f at tcp://172.30.0.65:4567 stable
2022-09-07T13:09:40.369529Z 0 [Note] WSREP: Node b4cfd8a6 state primary
2022-09-07T13:09:40.372186Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,54de9a1f,1313)
memb {
        54de9a1f,0
        b4cfd8a6,0
        }
joined {
        }
left {
        }
partitioned {
        }
)
2022-09-07T13:09:40.372214Z 0 [Note] WSREP: Save the discovered primary-component to disk
2022-09-07T13:09:40.374370Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2022-09-07T13:09:40.374397Z 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2022-09-07T13:09:40.669445Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 5530533a-2eae-11ed-9123-d3b2a26834eb
2022-09-07T13:09:40.669651Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 5530533a-2eae-11ed-9123-d3b2a26834eb from 0 (srv-db03)
2022-09-07T13:09:40.669665Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 5530533a-2eae-11ed-9123-d3b2a26834eb from 1 (srv-db01)
2022-09-07T13:09:40.669675Z 0 [Note] WSREP: Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 13,
        members    = 1/2 (primary/total),
        act_id     = 52221005,
        last_appl. = 52220894,
        protocols  = 0/9/3 (gcs/repl/appl),
        group UUID = 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9
2022-09-07T13:09:40.669684Z 0 [Note] WSREP: Flow-control interval: [141, 141]
2022-09-07T13:09:40.669797Z 7 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T13:09:40.669827Z 7 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T13:09:40.669839Z 7 [Note] WSREP: New cluster view: global state: 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52221005, view# 14: Primary, number of nodes: 2, my index: 1, protocol version 3
2022-09-07T13:09:40.669845Z 7 [Note] WSREP: Setting wsrep_ready to true
2022-09-07T13:09:40.669851Z 7 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 2) (Increment: 1 -> 2)
2022-09-07T13:09:40.669857Z 7 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-09-07T13:09:40.670069Z 7 [Note] WSREP: Assign initial position for certification: 52221005, protocol version: 4
2022-09-07T13:09:40.670165Z 0 [Note] WSREP: Service thread queue flushed.
2022-09-07T13:09:41.347037Z 0 [Note] WSREP: Member 0.0 (srv-db03) requested state transfer from '*any*'. Selected 1.0 (srv-db01)(SYNCED) as donor.
2022-09-07T13:09:41.347088Z 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 52221005)
2022-09-07T13:09:41.347143Z 7 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-09-07T13:09:41.347262Z 0 [Note] WSREP: Initiating SST/IST transfer on DONOR side (wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.30.0.65:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.38-41-57'   '' --gtid '8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52221005' )
2022-09-07T13:09:41.347762Z 7 [Note] WSREP: DONOR thread signaled with 0
        2022-09-07T13:09:41.695201Z WSREP_SST: [INFO] Streaming with xbstream
2022-09-07T13:09:43.375188Z 0 [Note] WSREP: (b4cfd8a6, 'tcp://0.0.0.0:4567') turning message relay requesting off
        2022-09-07T13:09:52.740847Z WSREP_SST: [INFO] Streaming the backup to joiner at 172.30.0.65 4444
2022-09-07T13:09:52.750106Z 9627 [Note] Access denied for user 'adam'@'localhost' (using password: NO)
        2022-09-07T13:09:52.753128Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
        2022-09-07T13:09:52.755891Z WSREP_SST: [ERROR] xtrabackup finished with error: 1.  Check /var/lib/mysql//innobackup.backup.log
--------------- innobackup.backup.log (START) --------------------
xtrabackup: recognized server arguments: --server-id=1 --datadir=/var/lib/mysql --log_bin --defaults_group=mysqld --parallel=4
xtrabackup: recognized client arguments: --port=3306 --socket=/var/run/mysqld/mysqld.sock --no-version-check=1 --socket=/var/run/mysqld/mysqld.sock --lock-ddl=1 --backup=1 --galera-info=1 --binlog-info=ON --stream=xbstream --target-dir=/tmp/pxc_sst_qpMQ/donor_xb_YAAT
220907 13:09:52 Connecting to MySQL server host: localhost, user: not set, password: not set, port: 3306, socket: /var/run/mysqld/mysqld.sock
Failed to connect to MySQL server: Access denied for user 'adam'@'localhost' (using password: NO).
--------------- innobackup.backup.log (END) ----------------------
        2022-09-07T13:09:52.759127Z WSREP_SST: [ERROR] ******************************************************
        2022-09-07T13:09:52.760826Z WSREP_SST: [ERROR] Cleanup after exit with status:22
2022-09-07T13:09:52.772575Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.30.0.65:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.38-41-57'   '' --gtid '8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52221005' : 22 (Invalid argument)
2022-09-07T13:09:52.772650Z 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.30.0.65:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.38-41-57'   '' --gtid '8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52221005'
2022-09-07T13:09:52.773101Z 0 [Warning] WSREP: 1.0 (srv-db01): State transfer to 0.0 (srv-db03) failed: -22 (Invalid argument)
2022-09-07T13:09:52.773120Z 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 52221005)
2022-09-07T13:09:53.773826Z 0 [Note] WSREP: forgetting 54de9a1f (tcp://172.30.0.65:4567)
2022-09-07T13:09:53.773842Z 0 [Note] WSREP: Member 1.0 (srv-db01) synced with group.
2022-09-07T13:09:53.773920Z 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 52221005)
2022-09-07T13:09:53.773896Z 0 [Note] WSREP: Node b4cfd8a6 state primary
2022-09-07T13:09:53.773974Z 2 [Note] WSREP: Synchronized with group, ready for connections
2022-09-07T13:09:53.774003Z 2 [Note] WSREP: Setting wsrep_ready to true
2022-09-07T13:09:53.774012Z 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-09-07T13:09:53.773986Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,b4cfd8a6,1314)
memb {
        b4cfd8a6,0
        }
joined {
        }
left {
        }
partitioned {
        54de9a1f,0
        }
)
2022-09-07T13:09:53.774077Z 0 [Note] WSREP: Save the discovered primary-component to disk
2022-09-07T13:09:53.777623Z 0 [Note] WSREP: forgetting 54de9a1f (tcp://172.30.0.65:4567)
2022-09-07T13:09:53.777647Z 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
2022-09-07T13:09:53.777805Z 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 5cff501c-2eae-11ed-99de-125c8611f164
2022-09-07T13:09:53.777823Z 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 5cff501c-2eae-11ed-99de-125c8611f164
2022-09-07T13:09:53.777833Z 0 [Note] WSREP: STATE EXCHANGE: got state msg: 5cff501c-2eae-11ed-99de-125c8611f164 from 0 (srv-db01)
2022-09-07T13:09:53.777843Z 0 [Note] WSREP: Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 14,
        members    = 1/1 (primary/total),
        act_id     = 52221005,
        last_appl. = 52220894,
        protocols  = 0/9/3 (gcs/repl/appl),
        group UUID = 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9
2022-09-07T13:09:53.777850Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2022-09-07T13:09:53.777961Z 3 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T13:09:53.777993Z 3 [Note] WSREP: REPL Protocols: 9 (4, 2)
2022-09-07T13:09:53.778010Z 3 [Note] WSREP: New cluster view: global state: 8054bb0c-c25d-11e9-8c73-ab46b0e3a8d9:52221005, view# 15: Primary, number of nodes: 1, my index: 0, protocol version 3
2022-09-07T13:09:53.778020Z 3 [Note] WSREP: Setting wsrep_ready to true
2022-09-07T13:09:53.778026Z 3 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 2 -> 1) (Increment: 2 -> 1)
2022-09-07T13:09:53.778032Z 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-09-07T13:09:53.778039Z 3 [Note] WSREP: Assign initial position for certification: 52221005, protocol version: 4
2022-09-07T13:09:53.778061Z 0 [Note] WSREP: Service thread queue flushed.
2022-09-07T13:09:58.880792Z 0 [Note] WSREP:  cleaning up 54de9a1f (tcp://172.30.0.65:4567)

Based on the logs there seems to be a permissions issue with the user you are configuring for SST. Try granting:

GRANT RELOAD, PROCESS, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'adam'@'localhost';

Strange… there is no user adam@localhost in that Database. Where did i configure the SST User?

I dont know why, but in my wsrep.conf was no SST Configured. I just entered a valid user and now my nodes could join. Thanks for the help.

glad to hear you sorted it out