Simulated a node C crash for MySQL Percona v8.0.27 and failed to rejoin cluster

G’day,

I have prepared a new server stack to upgrade our live databases from Percona 5.6 to version 8.0.27-18.1 on new hardware. More version info: Percona XtraDB Cluster (GPL), Release rel18, Revision ac35177, WSREP version 26.4.3

I have 3 KVM hosts each running a VM for pxc node A, B and C running on 10.0.4.60, 10.0.5.60 and 10.0.6.60 respectively. I wanted to cause a system failure by doing a hard reboot of the KVM host C (IP 10.0.6.60). The nodes A and B continue to operate correctly. When I restarted node C, it would not join the cluster and mysqld failed to start.

Checking /var/log/mysql/error.log on Node C, I see these error lines in the logs when initiating SST:

2022-07-24T12:48:16.897944Z 2 [Note] [MY-000000] [Galera] State transfer required:
Group state: 38313acd-f77c-11ec-8b46-8fc0d93253f1:8152
Local state: 00000000-0000-0000-0000-000000000000:-1
2022-07-24T12:48:16.897962Z 2 [Note] [MY-000000] [WSREP] Server status change connected → joiner
2022-07-24T12:48:16.897982Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2022-07-24T12:48:16.898341Z 0 [Note] [MY-000000] [WSREP] Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘10.0.6.60’ --datadir ‘/var/lib/mysql/’ --basedir ‘/usr/’ --plugindir ‘/usr/lib/mysql/plugin/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ‘’ --parent ‘1955’ --mysqld-version ‘8.0.27-18.1’ ‘’ )
2022-07-24T12:48:17.442640Z 0 [Warning] [MY-000000] [WSREP-SST] Found a stale sst_in_progress file: /var/lib/mysql//sst_in_progress
2022-07-24T12:48:17.953420Z 2 [Note] [MY-000000] [WSREP] Prepared SST request: xtrabackup-v2|10.0.6.60:4444/xtrabackup_sst//1
2022-07-24T12:48:17.953555Z 2 [Note] [MY-000000] [Galera] Check if state gap can be serviced using IST
2022-07-24T12:48:17.953638Z 2 [Note] [MY-000000] [Galera] Local UUID: 00000000-0000-0000-0000-000000000000 != Group UUID: 38313acd-f77c-11ec-8b46-8fc0d93253f1
2022-07-24T12:48:17.953703Z 2 [Note] [MY-000000] [Galera] ####### IST uuid:00000000-0000-0000-0000-000000000000 f: 0, l: 8152, STRv: 3
2022-07-24T12:48:17.953935Z 2 [Note] [MY-000000] [Galera] IST receiver addr using ssl://10.0.6.60:4568
2022-07-24T12:48:17.954089Z 2 [Note] [MY-000000] [Galera] IST receiver using ssl
2022-07-24T12:48:17.955215Z 2 [Note] [MY-000000] [Galera] Prepared IST receiver for 0-8152, listening at: ssl://10.0.6.60:4568
2022-07-24T12:48:17.956735Z 0 [Note] [MY-000000] [Galera] Member 2.0 (totecs-cluster-node-2) requested state transfer from ‘any’. Selected 0.0 (totecs-cluster-node-1)(SYNCED) as donor.
2022-07-24T12:48:17.956814Z 0 [Note] [MY-000000] [Galera] Shifting PRIMARY → JOINER (TO: 8152)
2022-07-24T12:48:17.956921Z 2 [Note] [MY-000000] [Galera] Requesting state transfer: success, donor: 0
2022-07-24T12:48:17.956957Z 2 [Note] [MY-000000] [Galera] Resetting GCache seqno map due to different histories.
2022-07-24T12:48:17.956986Z 2 [Note] [MY-000000] [Galera] GCache history reset: 38313acd-f77c-11ec-8b46-8fc0d93253f1:0 → 38313acd-f77c-11ec-8b46-8fc0d93253f1:8152
2022-07-24T12:48:17.959032Z 0 [Warning] [MY-000000] [Galera] 0.0 (totecs-cluster-node-1): State transfer to 2.0 (totecs-cluster-node-2) failed: -111 (Connection refused)
2022-07-24T12:48:17.960088Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1214: Will never receive state. Need to abort.
2022-07-24T12:48:17.960126Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread
2022-07-24T12:48:17.960195Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread
2022-07-24T12:48:17.960423Z 0 [Note] [MY-000000] [Galera] gcomm: closing backend

On the Donor node A, /var/log/mysql/error.log contains the following:

2022-07-24T12:48:16.886571Z 2 [Note] [MY-000000] [Galera]
View:
id: 38313acd-f77c-11ec-8b46-8fc0d93253f1:8152
status: primary
protocol_version: 4
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
final: no
own_index: 1
members(3):
0: becb78dc-0b4b-11ed-a681-1fda10a70f55, totecs-cluster-node-1
1: c385cf3d-0b4e-11ed-b94c-f22715badf8e, totecs-cluster-node-0
2: e2cee49e-0b4e-11ed-b7b3-a2469236d0de, totecs-cluster-node-2
2022-07-24T12:48:16.886625Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2022-07-24T12:48:16.891260Z 2 [Note] [MY-000000] [Galera] Recording CC from group: 8152
2022-07-24T12:48:16.891297Z 2 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 8035
2022-07-24T12:48:16.891310Z 2 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 6382
2022-07-24T12:48:17.942724Z 0 [Note] [MY-000000] [Galera] Member 2.0 (totecs-cluster-node-2) requested state transfer from ‘any’. Selected 0.0 (totecs-cluster-node-1)(SYNCED) as donor.
2022-07-24T12:48:17.944961Z 0 [Warning] [MY-000000] [Galera] 0.0 (totecs-cluster-node-1): State transfer to 2.0 (totecs-cluster-node-2) failed: -111 (Connection refused)

I then destroyed and redeployed the VM for node C and started a fresh init and I get the same error.

It looks like the cluster is preventing Node C from connecting. I also stopped node A and tried again with only node B running and the same issue occurs.

Is this an authentication issue with error -111 or is there some way to clear the block so node C can rejoin the cluster?

Or, am I missing something obvious?

Hi!
Did you check all ports were whitelisted and accessible to and from all nodes?

https://www.percona.com/doc/percona-xtradb-cluster/LATEST/security/secure-network.html#firewall-configuration

I wonder if you also checked network health, and tried to send some packets between node C and the others? Are you only seeing these issues with node C? Did you also try to force an SST from node C, by clearing out its datadir?

I’ve found problem. It was in certificates - they were not the same that on running node.
Thanks in all the case!