Simulated a node C crash for MySQL Percona v8.0.27 and failed to rejoin cluster

G’day,

I have prepared a new server stack to upgrade our live databases from Percona 5.6 to version 8.0.27-18.1 on new hardware. More version info: Percona XtraDB Cluster (GPL), Release rel18, Revision ac35177, WSREP version 26.4.3

I have 3 KVM hosts each running a VM for pxc node A, B and C running on 10.0.4.60, 10.0.5.60 and 10.0.6.60 respectively. I wanted to cause a system failure by doing a hard reboot of the KVM host C (IP 10.0.6.60). The nodes A and B continue to operate correctly. When I restarted node C, it would not join the cluster and mysqld failed to start.

Checking /var/log/mysql/error.log on Node C, I see these error lines in the logs when initiating SST:

2022-07-24T12:48:16.897944Z 2 [Note] [MY-000000] [Galera] State transfer required:
Group state: 38313acd-f77c-11ec-8b46-8fc0d93253f1:8152
Local state: 00000000-0000-0000-0000-000000000000:-1
2022-07-24T12:48:16.897962Z 2 [Note] [MY-000000] [WSREP] Server status change connected → joiner
2022-07-24T12:48:16.897982Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2022-07-24T12:48:16.898341Z 0 [Note] [MY-000000] [WSREP] Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘10.0.6.60’ --datadir ‘/var/lib/mysql/’ --basedir ‘/usr/’ --plugindir ‘/usr/lib/mysql/plugin/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ‘’ --parent ‘1955’ --mysqld-version ‘8.0.27-18.1’ ‘’ )
2022-07-24T12:48:17.442640Z 0 [Warning] [MY-000000] [WSREP-SST] Found a stale sst_in_progress file: /var/lib/mysql//sst_in_progress
2022-07-24T12:48:17.953420Z 2 [Note] [MY-000000] [WSREP] Prepared SST request: xtrabackup-v2|10.0.6.60:4444/xtrabackup_sst//1
2022-07-24T12:48:17.953555Z 2 [Note] [MY-000000] [Galera] Check if state gap can be serviced using IST
2022-07-24T12:48:17.953638Z 2 [Note] [MY-000000] [Galera] Local UUID: 00000000-0000-0000-0000-000000000000 != Group UUID: 38313acd-f77c-11ec-8b46-8fc0d93253f1
2022-07-24T12:48:17.953703Z 2 [Note] [MY-000000] [Galera] ####### IST uuid:00000000-0000-0000-0000-000000000000 f: 0, l: 8152, STRv: 3
2022-07-24T12:48:17.953935Z 2 [Note] [MY-000000] [Galera] IST receiver addr using ssl://10.0.6.60:4568
2022-07-24T12:48:17.954089Z 2 [Note] [MY-000000] [Galera] IST receiver using ssl
2022-07-24T12:48:17.955215Z 2 [Note] [MY-000000] [Galera] Prepared IST receiver for 0-8152, listening at: ssl://10.0.6.60:4568
2022-07-24T12:48:17.956735Z 0 [Note] [MY-000000] [Galera] Member 2.0 (totecs-cluster-node-2) requested state transfer from ‘any’. Selected 0.0 (totecs-cluster-node-1)(SYNCED) as donor.
2022-07-24T12:48:17.956814Z 0 [Note] [MY-000000] [Galera] Shifting PRIMARY → JOINER (TO: 8152)
2022-07-24T12:48:17.956921Z 2 [Note] [MY-000000] [Galera] Requesting state transfer: success, donor: 0
2022-07-24T12:48:17.956957Z 2 [Note] [MY-000000] [Galera] Resetting GCache seqno map due to different histories.
2022-07-24T12:48:17.956986Z 2 [Note] [MY-000000] [Galera] GCache history reset: 38313acd-f77c-11ec-8b46-8fc0d93253f1:0 → 38313acd-f77c-11ec-8b46-8fc0d93253f1:8152
2022-07-24T12:48:17.959032Z 0 [Warning] [MY-000000] [Galera] 0.0 (totecs-cluster-node-1): State transfer to 2.0 (totecs-cluster-node-2) failed: -111 (Connection refused)
2022-07-24T12:48:17.960088Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1214: Will never receive state. Need to abort.
2022-07-24T12:48:17.960126Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread
2022-07-24T12:48:17.960195Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread
2022-07-24T12:48:17.960423Z 0 [Note] [MY-000000] [Galera] gcomm: closing backend

On the Donor node A, /var/log/mysql/error.log contains the following:

2022-07-24T12:48:16.886571Z 2 [Note] [MY-000000] [Galera]
View:
id: 38313acd-f77c-11ec-8b46-8fc0d93253f1:8152
status: primary
protocol_version: 4
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
final: no
own_index: 1
members(3):
0: becb78dc-0b4b-11ed-a681-1fda10a70f55, totecs-cluster-node-1
1: c385cf3d-0b4e-11ed-b94c-f22715badf8e, totecs-cluster-node-0
2: e2cee49e-0b4e-11ed-b7b3-a2469236d0de, totecs-cluster-node-2
2022-07-24T12:48:16.886625Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2022-07-24T12:48:16.891260Z 2 [Note] [MY-000000] [Galera] Recording CC from group: 8152
2022-07-24T12:48:16.891297Z 2 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 8035
2022-07-24T12:48:16.891310Z 2 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 6382
2022-07-24T12:48:17.942724Z 0 [Note] [MY-000000] [Galera] Member 2.0 (totecs-cluster-node-2) requested state transfer from ‘any’. Selected 0.0 (totecs-cluster-node-1)(SYNCED) as donor.
2022-07-24T12:48:17.944961Z 0 [Warning] [MY-000000] [Galera] 0.0 (totecs-cluster-node-1): State transfer to 2.0 (totecs-cluster-node-2) failed: -111 (Connection refused)

I then destroyed and redeployed the VM for node C and started a fresh init and I get the same error.

It looks like the cluster is preventing Node C from connecting. I also stopped node A and tried again with only node B running and the same issue occurs.

Is this an authentication issue with error -111 or is there some way to clear the block so node C can rejoin the cluster?

Or, am I missing something obvious?

1 Like

Hi!
Did you check all ports were whitelisted and accessible to and from all nodes?

I wonder if you also checked network health, and tried to send some packets between node C and the others? Are you only seeing these issues with node C? Did you also try to force an SST from node C, by clearing out its datadir?

1 Like