SST dies after 120 second Stall after transferring 300+GB on AWS only

sgales · May 11, 2022, 9:37pm

Hey folks, running into a weird problem within an AWS DC we have, SST transfers work for a while, then after 300+ GB of transfer they suddenly stall for 120+ seconds and timeout. We are sort of working around this using ncat and scps, but without SSTs working thats a lot of stuff to maintain. We are trying an SST between nodes, but we keep getting the error:
“ERROR] Killing SST (2319) with SIGKILL after stalling for 120 seconds”

Basic setup:

Docker container on ec2 instances
So far as we know, the security groups are setup in such a way that SST/galera comms ports work without issue especially between nodes in the cluster.
Using percona 5.7.35-38-57 and 5.7.36-39-57 versions, both fail.

We lowered the net.ipv4.tcp_keepalive_time=60 thinking this might help and it seems to have helped the DB go from 300GB transfer to ~800GB transfer. Sadly we have 1.7TB of data to transfer so we’re falling short of the goal. This seems to only happen to our clusters in our us-east-2 DC in AWS. Co-los and other AWS regions seem relatively unaffected (though we don’t have as large databases in our other AWS regions).

Here’s the log from the receiver side (in a docker container):

2022-05-11T17:16:01.120928Z 0 [Note] WSREP: (052bf080, 'ssl://0.0.0.0:4567') turning message relay requesting off
	2022-05-11T19:17:45.796316Z WSREP_SST: [ERROR] Killing SST (2319) with SIGKILL after stalling for 120 seconds
/usr/bin/wsrep_sst_xtrabackup-v2: line 195:  2321 Killed                  socat -u openssl-listen:4444,reuseaddr,cert=/var/lib/mysql/certs/cert.pem,key=/var/lib/mysql/certs/key.pem,cafile=/var/lib/mysql/certs/ca.pem,verify=1,retry=30 stdio
      2322                       | lz4 -d
      2323                       | xbstream -x $xbstream_eopts
	2022-05-11T19:17:45.801127Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
	2022-05-11T19:17:45.802282Z WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 137 137 137
	2022-05-11T19:17:45.803401Z WSREP_SST: [ERROR] ******************************************************
2022-05-11T19:17:45.804955Z WSREP_SST: [ERROR] Cleanup after exit with status:32
2022-05-11T19:17:45.810291Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.66.68.8:4444' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '1708' --mysqld-version '5.7.35-38-57'  --binlog '/var/lib/mysql/binlog/mysql-bin' : 32 (Broken pipe)
2022-05-11T19:17:45.810332Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2022-05-11T19:17:45.810343Z 0 [ERROR] WSREP: SST script aborted with error 32 (Broken pipe)
2022-05-11T19:17:45.810386Z 0 [ERROR] WSREP: SST failed: 32 (Broken pipe)

From the sender its pretty uninteresting, just showing that it timed out and failed to transfer:

xtrabackup: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
xb_stream_write_data() failed.
xtrabackup: Error writing file 'UNOPENED' (Errcode: 32 - Broken pip

and

  2022-05-11T19:17:46.861872Z WSREP_SST: [ERROR] ******************************************************
        2022-05-11T19:17:46.863480Z WSREP_SST: [ERROR] Cleanup after exit with status:22
2022-05-11T19:17:46.872298Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.1.1:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.35-38-57'  --binlog '/var/lib/mysql/binlog/mysql-bin' --gtid 'gtid:pos' : 22 (Invalid argument)
2022-05-11T19:17:46.872363Z 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.1.1:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.35-38-57'  --binlog '/var/lib/mysql/binlog/mysql-bin' --gtid 'gtid:pos'

(I;ve replaced the ip and gtid values with just strings - they are valid values in this case)

sgales · June 23, 2022, 8:03pm

Just in case anyone runs into this, we solved this with setting the sst-idle-timeout variable in our my.cnf to 3600 seconds. Our guess was that during the transfer it took a bit longer to restore certain tables or unpack/transfer them so by the time the SST thread finished, it had automatically timed out and killed the connection. Not sure if thats how the SST process actually works, but bumping it caused previously failed SSTs to pass.

Frak · May 24, 2023, 10:12am

Hi,
I think we’ve experiencd the same issue.
I’ve put the value in my.cnf and bounced the instance.
How can I check if this value is effectively in place?

lalit.choudhary · June 28, 2023, 7:44am

This issue is fixed in version: 5.7.38-31.59 and 8.0.28-19
https://jira.percona.com/browse/PXC-3951

Topic		Replies	Views
Xtradb cluster keeps failing to join cluster Percona XtraDB Cluster 5.x	12	20522	December 11, 2019
Cannot Get SST Xtrabackup Transfer to Work Percona Server for MySQL 5.7	11	1615	January 16, 2023
SST parallel not working Percona XtraDB Cluster 5.x	5	1013	November 13, 2015
Joining cluster fails because of SST timeout Percona XtraDB Cluster 5.x	4	1822	August 3, 2017
Freshly installed PXC 5.5.41-37.0 box aborts with no diagnostics after successful SST Percona XtraDB Cluster 5.x	6	803	June 14, 2016

SST dies after 120 second Stall after transferring 300+GB on AWS only

Related topics