Stalled on STT even when other nodes are down

Greetings,

We have a two-node cluster and need to do a dirty restart. On rejoining the cluster we had a network glitch and now the primary (and only) node is stuck on

FROM NODE1 (which is running)

2021-12-13T08:39:28.428180Z 0 [Warning] WSREP: Member 0.0 (NODE2) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2021-12-13T08:39:29.429392Z 0 [Warning] WSREP: Member 0.0 (NODE2) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2021-12-13T08:39:30.430606Z 0 [Warning] WSREP: Member 0.0 (NODE2) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable

NODE1 has this STATUS:

wsrep_local_state_comment | Donor/Desynced

It never ends or goes out of that status … even if the other node is shut down.

Is it possible to get out of that donor status so we can actually restart the process?

1 Like

Hi Rafael,

What version are you in ? Are both nodes same version?
and how big your data set is ?

On each node there should be a “innobackupex.log” file. On donor it is the backup log, and on joiner it’s the log and both are located on “datadir” by default.

Make sure that the backup from “donor” is not still running by checking the log. Also make sure that wsrep_cluster_status is “Primary” as in The Primary Component in Galera Cluster | Galera Cluster for MySQL

Last but not least, PXC clusters must be 3 nodes , or else any glitch or flip in communication for a node will take down the other node as well. If you still plan to have two database nodes , you should setup a 3rd node with arbitrator. Arbitrator resource usage must be located on a 3rd box and use minimal resources just for cluster communication and quorum.
You can read more on:
https://galeracluster.com/library/documentation/arbitrator.html

Regards

1 Like

Thanxs @CTutte

What version are you in ?
percona-xtradb-cluster-full-57

Are both nodes same version?
Yes.

and how big your data set is ?
To my understanding it has to sync the entire DB … so like 750Gb or so.

innobackup.log from DONOR:

211214 10:32:32 >> log scanned up to (14740248342320)
211214 10:32:33 >> log scanned up to (14740248351203)
211214 10:32:34 >> log scanned up to (14740248359240)
xtrabackup: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
xb_stream_write_data() failed.
compress: write to the destination stream failed.
xtrabackup: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
[04] xtrabackup: Error: xtrabackup_copy_datafile() failed.
[04] xtrabackup: Error: failed to copy datafile.

mysql.log from DONOR

2021-12-14T09:30:07.821070Z 0 [ERROR] WSREP: Process was aborted.
2021-12-14T09:30:07.821182Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.2.1:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.33-36-57'  --binlog '/var/log/mysql/mysql-bin' --gtid 'f4b2cd0b-1721-11e4-9cfc-830b046aee12:12424599779' : 2 (No such file or directory)
2021-12-14T09:30:07.821295Z 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.2.1:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.33-36-57'  --binlog '/var/log/mysql/mysql-bin' --gtid 'f4b2cd0b-1721-11e4-9cfc-830b046aee12:12424599779' 
2021-12-14T09:30:07.821336Z 0 [Warning] WSREP: Could not find peer: 73b39a0b-5bc8-11ec-b90b-eb514a870cac
2021-12-14T09:30:07.821402Z 0 [Warning] WSREP: 0.0 (17db3): State transfer to -1.-1 (left the group) failed: -2 (No such file or directory)
	2021-12-14T09:37:37.161125Z WSREP_SST: [INFO] Streaming with xbstream
	2021-12-14T09:37:47.231616Z WSREP_SST: [INFO] Streaming the backup to joiner at 10.0.2.1 4444
2021/12/14 09:49:06 socat[3394852] E write(6, 0x55cf54eb47b0, 8192): Connection reset by peer
2021-12-14T10:12:38.415636Z 0 [ERROR] WSREP: Process was aborted.
2021-12-14T10:12:38.415777Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.2.1:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.33-36-57'  --binlog '/var/log/mysql/mysql-bin' --gtid 'f4b2cd0b-1721-11e4-9cfc-830b046aee12:12425100275' : 2 (No such file or directory)
2021-12-14T10:12:38.415876Z 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.2.1:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.33-36-57'  --binlog '/var/log/mysql/mysql-bin' --gtid 'f4b2cd0b-1721-11e4-9cfc-830b046aee12:12425100275' 
2021-12-14T10:12:38.416533Z 0 [Warning] WSREP: Could not find peer: 5d6d3dbe-5be7-11ec-8fe7-ba5747d51203
2021-12-14T10:12:38.416592Z 0 [Warning] WSREP: 1.0 (17db3): State transfer to -1.-1 (left the group) failed: -2 (No such file or directory)
	2021-12-14T10:14:01.662563Z WSREP_SST: [INFO] Streaming with xbstream
	2021-12-14T10:14:11.729361Z WSREP_SST: [INFO] Streaming the backup to joiner at 10.0.2.1 4444
2021/12/14 10:32:34 socat[3399629] E write(6, 0x558d6c53f7b0, 8192): Connection reset by peer

mysql.log from JOINER

Log of wsrep recovery (--wsrep-recover):
2021-12-14T09:37:24.620190Z 0 [Warning] The syntax '--log_warnings/-W' is deprecated and will be removed in a future release. Please use '--log_error_verbosity' instead.
2021-12-14T09:37:24.620420Z 0 [Warning] options --log-slow-admin-statements, --log-queries-not-using-indexes and --log-slow-slave-statements have no effect if --slow-query-log is not set
2021-12-14T09:37:24.620427Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2021-12-14T09:37:31.336308Z 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
mysqld: Table 'mysql.plugin' doesn't exist
2021-12-14T09:37:31.337330Z 0 [ERROR] Can't open the mysql.plugin table. Please run mysql_upgrade to create it.
2021-12-14T09:37:31.337984Z 0 [Warning] WSREP: Read non-wsrep XID from storage engines.
2021-12-14T09:37:35.156842Z 0 [Warning] The syntax '--log_warnings/-W' is deprecated and will be removed in a future release. Please use '--log_error_verbosity' instead.
2021-12-14T09:37:35.157057Z 0 [Warning] options --log-slow-admin-statements, --log-queries-not-using-indexes and --log-slow-slave-statements have no effect if --slow-query-log is not set
2021-12-14T09:37:35.157064Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2021-12-14T09:37:35.683869Z 2 [Warning] WSREP: Gap in state sequence. Need state transfer.
	2021-12-14T09:37:36.155851Z WSREP_SST: [INFO] Streaming with xbstream
	2021-12-14T09:37:36.167044Z WSREP_SST: [WARNING] Found a stale sst_in_progress file: /var/lib/mysql//sst_in_progress
	2021-12-14T09:37:37.311431Z WSREP_SST: [INFO] Streaming with xbstream
	2021-12-14T09:37:37.322334Z WSREP_SST: [INFO] WARNING: Stale temporary SST directory: /var/lib/mysql//.sst from previous state transfer. Removing
	2021-12-14T09:37:37.341404Z WSREP_SST: [INFO] Proceeding with SST.........
	2021-12-14T09:37:37.371811Z WSREP_SST: [INFO] ............Waiting for SST streaming to complete!
	2021-12-14T09:49:06.052017Z WSREP_SST: [ERROR] Killing SST (3632138) with SIGKILL after stalling for 120 seconds
/usr/bin/wsrep_sst_xtrabackup-v2: line 195: 3632140 Killed                  socat -u TCP-LISTEN:4444,reuseaddr,retry=30 stdio
     3632141                       | xbstream -x $xbstream_eopts
	2021-12-14T09:49:06.075142Z WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** 
	2021-12-14T09:49:06.077741Z WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 137 137
	2021-12-14T09:49:06.080179Z WSREP_SST: [ERROR] ****************************************************** 
	2021-12-14T09:49:06.084031Z WSREP_SST: [ERROR] Cleanup after exit with status:32
2021-12-14T09:49:06.110694Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.0.2.1' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '3631529' --mysqld-version '5.7.35-38-57'  --binlog '/var/log/mysql/mysql-bin' : 32 (Broken pipe)
2021-12-14T09:49:06.110774Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2021-12-14T09:49:06.110788Z 0 [ERROR] WSREP: SST script aborted with error 32 (Broken pipe)
2021-12-14T09:49:06.110882Z 0 [ERROR] WSREP: SST failed: 32 (Broken pipe)
2021-12-14T09:49:06.110911Z 0 [ERROR] Aborting

make sure that wsrep_cluster_status is “Primary”
Yes … the donor returns “Primary” on
show status like 'wsrep_cluster_status';
wsrep_cluster_status Primary

As said, our donor is stalled on “wsrep_local_state_comment Donor/Desynced” state even after the SST has failed and it’s aborted. We cannot push it out of that state so we can try again or do anything else. Its operational R/W and providing service but we don’t know how to clean that status so we can attempt a new rejoin

P.D.
I have also realized about wsrep_cluster_size 2 … so primary it’s thinking there is another node in the cluster … but the other node is down.

P.D.2
We DO have an arbitrator node as well, yes … but it’s not a data node.

1 Like