Recover XtraDB cluster nodes that cannot start

I’m testing a 3-node XtraDB cluster running 8.4. I managed to bork 2 of the nodes. The one now that remained running is still able to start in bootstrap mode, however the other 2 will not start and error the same way when attempting to recover by talking to the running node as donor. How exactly do I recover the cluster in this scenario?

mysqld.log on donor

2025-02-04T06:02:03.680717Z 0 [ERROR] [MY-000000] [WSREP-SST] ------------ innobackup.backup.log (cont) ------------
        2025-02-03T22:02:03.477081-08:00 7 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/user_variables_b_186.sdi to <STDOUT>
        2025-02-03T22:02:03.477127-08:00 9 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/log_status_184.sdi to <STDOUT>
        2025-02-03T22:02:03.477182-08:00 8 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/replication_appl_180.sdi to <STDOUT>
        2025-02-03T22:02:03.477183-08:00 6 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/replication_appl_181.sdi to <STDOUT>
        2025-02-03T22:02:03.478271-08:00 8 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/replication_asyn_183.sdi to <STDOUT>
        2025-02-03T22:02:03.478306-08:00 6 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/status_by_accoun_187.sdi to <STDOUT>
        2025-02-03T22:02:03.478365-08:00 9 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/log_status_184.sdi to <STDOUT>
        2025-02-03T22:02:03.478366-08:00 7 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/user_variables_b_186.sdi to <STDOUT>
        2025-02-03T22:02:03.479090-08:00 6 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/status_by_accoun_187.sdi to <STDOUT>
        2025-02-03T22:02:03.479178-08:00 8 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/replication_asyn_183.sdi to <STDOUT>
        2025-02-03T22:02:03.480074-08:00 6 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/global_status_191.sdi to <STDOUT>
        2025-02-03T22:02:03.480229-08:00 8 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/status_by_user_190.sdi to <STDOUT>
        2025-02-03T22:02:03.480582-08:00 7 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/status_by_thread_189.sdi to <STDOUT>
        2025-02-03T22:02:03.480602-08:00 9 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/prepared_stateme_185.sdi to <STDOUT>
        2025-02-03T22:02:03.481726-08:00 8 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/status_by_user_190.sdi to <STDOUT>
        2025-02-03T22:02:03.481727-08:00 6 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/global_status_191.sdi to <STDOUT>
        2025-02-03T22:02:03.482936-08:00 6 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/replication_asyn_182.sdi to <STDOUT>
        2025-02-03T22:02:03.483004-08:00 9 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/prepared_stateme_185.sdi to <STDOUT>
        2025-02-03T22:02:03.483089-08:00 7 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/status_by_thread_189.sdi to <STDOUT>
        2025-02-03T22:02:03.483834-08:00 6 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/replication_asyn_182.sdi to <STDOUT>
        2025-02-03T22:02:03.484323-08:00 9 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/session_status_192.sdi to <STDOUT>
        2025-02-03T22:02:03.482937-08:00 8 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/status_by_host_188.sdi to <STDOUT>
        2025-02-03T22:02:03.484681-08:00 6 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/global_variables_194.sdi to <STDOUT>
        2025-02-03T22:02:03.484912-08:00 9 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/session_status_192.sdi to <STDOUT>
        2025-02-03T22:02:03.485108-08:00 7 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/session_variable_195.sdi to <STDOUT>
        2025-02-03T22:02:03.485160-08:00 6 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/global_variables_194.sdi to <STDOUT>
        2025-02-03T22:02:03.486088-08:00 9 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/variables_info_196.sdi to <STDOUT>
        2025-02-03T22:02:03.486137-08:00 6 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/variables_by_thr_193.sdi to <STDOUT>
        2025-02-03T22:02:03.486184-08:00 7 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/session_variable_195.sdi to <STDOUT>
        2025-02-03T22:02:03.486185-08:00 8 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/status_by_host_188.sdi to <STDOUT>
        2025-02-03T22:02:03.487203-08:00 7 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/binary_log_trans_199.sdi to <STDOUT>
        2025-02-03T22:02:03.487320-08:00 6 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/variables_by_thr_193.sdi to <STDOUT>
        2025-02-03T22:02:03.487334-08:00 8 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/tls_channel_stat_200.sdi to <STDOUT>
        2025-02-03T22:02:03.487413-08:00 9 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/variables_info_196.sdi to <STDOUT>
        2025-02-03T22:02:03.488473-08:00 6 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/malloc_stats_tot_201.sdi to <STDOUT>
        2025-02-03T22:02:03.488538-08:00 9 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/user_defined_fun_198.sdi to <STDOUT>
        2025-02-03T22:02:03.488572-08:00 7 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/binary_log_trans_199.sdi to <STDOUT>
        2025-02-03T22:02:03.488590-08:00 8 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/tls_channel_stat_200.sdi to <STDOUT>
        2025-02-03T22:02:03.489584-08:00 7 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/keyring_componen_203.sdi to <STDOUT>
        2025-02-03T22:02:03.489679-08:00 6 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/malloc_stats_tot_201.sdi to <STDOUT>
        2025-02-03T22:02:03.489782-08:00 9 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/user_defined_fun_198.sdi to <STDOUT>
        2025-02-03T22:02:03.489844-08:00 8 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/persisted_variab_197.sdi to <STDOUT>
        2025-02-03T22:02:03.490750-08:00 6 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/malloc_stats_202.sdi to <STDOUT>
        2025-02-03T22:02:03.490839-08:00 7 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/keyring_componen_203.sdi to <STDOUT>
        2025-02-03T22:02:03.490961-08:00 8 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/persisted_variab_197.sdi to <STDOUT>
        2025-02-03T22:02:03.491661-08:00 7 [Note] [MY-011825] [Xtrabackup] Streaming <STDOUT>
        2025-02-03T22:02:03.491682-08:00 7 [Note] [MY-011825] [Xtrabackup] Done: Streaming file <STDOUT>
        2025-02-03T22:02:03.491741-08:00 9 [Note] [MY-011825] [Xtrabackup] Streaming performance_schema/pxc_cluster_view_204.sdi to <STDOUT>
        2025-02-03T22:02:03.492096-08:00 6 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/malloc_stats_202.sdi to <STDOUT>
        2025-02-03T22:02:03.493453-08:00 9 [Note] [MY-011825] [Xtrabackup] Done: Streaming performance_schema/pxc_cluster_view_204.sdi to <STDOUT>
        2025-02-03T22:02:03.511915-08:00 0 [Note] [MY-011825] [Xtrabackup] Finished backing up non-InnoDB tables and files
        2025-02-03T22:02:03.511941-08:00 0 [Note] [MY-011825] [Xtrabackup] Executing FLUSH NO_WRITE_TO_BINLOG BINARY LOGS
        2025-02-03T22:02:03.520893-08:00 0 [Note] [MY-011825] [Xtrabackup] Selecting LSN and binary log position from p_s.log_status
        2025-02-04T06:02:03Z UTC - mysqld got signal 11 ;
        Signal SIGSEGV (unknown siginfo_t::si_code) at address 0x2030
        Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
        BuildID[sha1]=
        Thread pointer: 0x0
        Attempting backtrace. You can use the following information to find out
        where mysqld died. If you see no messages after this, something went
        terribly wrong...
        stack_bottom = 0 thread_stack 0x100000
         #0 0x7f7a48e3e72f <unknown>
         #1 0x7f7a48f5cefd <unknown>
         #2 0x8e89f4 <unknown>
         #3 0x90c0f5 <unknown>
         #4 0x8e616a <unknown>
         #5 0x886e82 <unknown>
         #6 0x7f7a48e295cf <unknown>
         #7 0x7f7a48e2967f <unknown>
         #8 0x8b6234 <unknown>
         #9 0xffffffffffffffff <unknown>

        Please report a bug at https://jira.percona.com/projects/PXB

2025-02-04T06:02:03.680736Z 0 [ERROR] [MY-000000] [WSREP-SST] ------------ innobackup.backup.log (END) ------------
2025-02-04T06:02:03.680743Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 2098
2025-02-04T06:02:03.680772Z 0 [ERROR] [MY-000000] [WSREP-SST] ****************************************************** 
2025-02-04T06:02:03.680851Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:22
2025-02-04T06:02:03.696715Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.2.130.221:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '8.4.3-3.1'  --binlog 'binlog' --gtid '4d457573-df6a-11ef-80cd-cb7da77c25c8:42' : 22 (Invalid argument)
2025-02-04T06:02:03.699175Z 0 [Note] [MY-000000] [Galera] SST sending failed: -22
2025-02-04T06:02:03.699195Z 0 [Note] [MY-000000] [WSREP] Server status change donor -> joined
2025-02-04T06:02:03.699217Z 0 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2025-02-04T06:02:03.699254Z 0 [ERROR] [MY-000000] [WSREP] Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.2.130.221:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '8.4.3-3.1'  --binlog 'binlog' --gtid '4d457573-df6a-11ef-80cd-cb7da77c25c8:42' 
2025-02-04T06:02:03.701876Z 0 [Warning] [MY-000000] [Galera] 0.0 (dr-test-sqldb2): State transfer to 1.0 (dr-test-sqldb1) failed: Invalid argument
2025-02-04T06:02:03.701898Z 0 [Note] [MY-000000] [Galera] Shifting DONOR/DESYNCED -> JOINED (TO: 43)

mysqld.log on failed member nodes

2025-02-04T06:02:02.087121Z 0 [Note] [MY-000000] [Galera] (855b77c2-94fa, 'ssl://0.0.0.0:4567') turning message relay requesting off
2025-02-04T06:02:02.881401Z 0 [Note] [MY-000000] [Galera] Member 2.0 (dr-test-sqldb3) requested state transfer from '*any*', but it is impossible to select State Transfer donor: No donor candidates temporarily available in suitable state
2025-02-04T06:02:03.687452Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR **********************
2025-02-04T06:02:03.687491Z 0 [ERROR] [MY-000000] [WSREP-SST] xtrabackup_checkpoints missing. xtrabackup/SST failed on DONOR. Check DONOR log
2025-02-04T06:02:03.687501Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 2384
2025-02-04T06:02:03.687538Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************************************************
2025-02-04T06:02:03.687647Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:2
2025-02-04T06:02:03.702523Z 0 [Warning] [MY-000000] [Galera] 0.0 (dr-test-sqldb2): State transfer to 1.0 (dr-test-sqldb1) failed: Invalid argument
2025-02-04T06:02:03.702554Z 0 [ERROR] [MY-000000] [Galera] ../../../../percona-xtradb-cluster-galera/gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1310: Will never receive state. Need to abort.

I tried following the crash recovery documentation found here Crash recovery - Percona XtraDB Cluster and using the command SET GLOBAL wsrep_provider_options='pc.bootstrap=true'; but that doesn’t seem to help. It still errors the exact same way.

It looks like there some sort of issue with xtrabackup. I’m able to reproduce the same exact error if I run it manually.

/usr/bin/pxc_extra/pxb-8.4/bin/xtrabackup --socket=/var/lib/mysql/mysql.sock --no-version-check=1 --parallel=4 --socket=/var/lib/mysql/mysql.sock --lock-ddl=1 --backup=1 --galera-info=1 --stream=xbstream --xtrabackup-plugin-dir=/usr/bin/pxc_extra/pxb-8.4/lib/plugin -u root -p
...
2025-02-05T10:24:18.205753-08:00 7 [Note] [MY-011825] [Xtrabackup] Done: Streaming file <STDOUT>
XBSTCK01Extrabackup_backupfiles/db.opt2025-02-05T10:24:18.205759-08:00 9 [Note] [MY-011825] [Xtrabackup] Streaming <STDOUT>
2025-02-05T10:24:18.205925-08:00 9 [Note] [MY-011825] [Xtrabackup] Done: Streaming file <STDOUT>
XBSTCK01Etest_db/db.opt2025-02-05T10:24:18.268614-08:00 0 [Note] [MY-011825] [Xtrabackup] Finished backing up non-InnoDB tables and files
2025-02-05T10:24:18.268738-08:00 0 [Note] [MY-011825] [Xtrabackup] Executing FLUSH NO_WRITE_TO_BINLOG BINARY LOGS
2025-02-05T10:24:18.277818-08:00 0 [Note] [MY-011825] [Xtrabackup] Selecting LSN and binary log position from p_s.log_status
2025-02-05T18:24:18Z UTC - mysqld got signal 11 ;
Signal SIGSEGV (unknown siginfo_t::si_code) at address 0x0
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
BuildID[sha1]=
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x100000
 #0 0x7f3009c3e72f <unknown>
 #1 0x7f3009d5cefd <unknown>
 #2 0x8e89f4 <unknown>
 #3 0x90c0f5 <unknown>
 #4 0x8e616a <unknown>
 #5 0x886e82 <unknown>
 #6 0x7f3009c295cf <unknown>
 #7 0x7f3009c2967f <unknown>
 #8 0x8b6234 <unknown>
 #9 0xffffffffffffffff <unknown>

I rebuilt the whole cluster fresh and re-ran the xtrabackup command I previously mentioned and it completed successfully. I’m assuming the previous backup attempt had failed because the cluster had gotten into some weird state where quorum could not be reached. I recall seeing in the logs where a vote was happening but only 1 vote was casted by the healthy node. Not entirely sure if this happens again how to repair the cluster.

Similar Xtrabackup problem has been reported here

https://perconadev.atlassian.net/browse/PXB-3377