SST dies after 120 second Stall after transferring 300+GB on AWS only

Hey folks, running into a weird problem within an AWS DC we have, SST transfers work for a while, then after 300+ GB of transfer they suddenly stall for 120+ seconds and timeout. We are sort of working around this using ncat and scps, but without SSTs working thats a lot of stuff to maintain. We are trying an SST between nodes, but we keep getting the error:
“ERROR] Killing SST (2319) with SIGKILL after stalling for 120 seconds”

Basic setup:

  • Docker container on ec2 instances
  • So far as we know, the security groups are setup in such a way that SST/galera comms ports work without issue especially between nodes in the cluster.
  • Using percona 5.7.35-38-57 and 5.7.36-39-57 versions, both fail.

We lowered the net.ipv4.tcp_keepalive_time=60 thinking this might help and it seems to have helped the DB go from 300GB transfer to ~800GB transfer. Sadly we have 1.7TB of data to transfer so we’re falling short of the goal. This seems to only happen to our clusters in our us-east-2 DC in AWS. Co-los and other AWS regions seem relatively unaffected (though we don’t have as large databases in our other AWS regions).

Here’s the log from the receiver side (in a docker container):

2022-05-11T17:16:01.120928Z 0 [Note] WSREP: (052bf080, 'ssl://') turning message relay requesting off
	2022-05-11T19:17:45.796316Z WSREP_SST: [ERROR] Killing SST (2319) with SIGKILL after stalling for 120 seconds
/usr/bin/wsrep_sst_xtrabackup-v2: line 195:  2321 Killed                  socat -u openssl-listen:4444,reuseaddr,cert=/var/lib/mysql/certs/cert.pem,key=/var/lib/mysql/certs/key.pem,cafile=/var/lib/mysql/certs/ca.pem,verify=1,retry=30 stdio
      2322                       | lz4 -d
      2323                       | xbstream -x $xbstream_eopts
	2022-05-11T19:17:45.801127Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
	2022-05-11T19:17:45.802282Z WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 137 137 137
	2022-05-11T19:17:45.803401Z WSREP_SST: [ERROR] ******************************************************
2022-05-11T19:17:45.804955Z WSREP_SST: [ERROR] Cleanup after exit with status:32
2022-05-11T19:17:45.810291Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '1708' --mysqld-version '5.7.35-38-57'  --binlog '/var/lib/mysql/binlog/mysql-bin' : 32 (Broken pipe)
2022-05-11T19:17:45.810332Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2022-05-11T19:17:45.810343Z 0 [ERROR] WSREP: SST script aborted with error 32 (Broken pipe)
2022-05-11T19:17:45.810386Z 0 [ERROR] WSREP: SST failed: 32 (Broken pipe)

From the sender its pretty uninteresting, just showing that it timed out and failed to transfer:

xtrabackup: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
xb_stream_write_data() failed.
xtrabackup: Error writing file 'UNOPENED' (Errcode: 32 - Broken pip


  2022-05-11T19:17:46.861872Z WSREP_SST: [ERROR] ******************************************************
        2022-05-11T19:17:46.863480Z WSREP_SST: [ERROR] Cleanup after exit with status:22
2022-05-11T19:17:46.872298Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.35-38-57'  --binlog '/var/lib/mysql/binlog/mysql-bin' --gtid 'gtid:pos' : 22 (Invalid argument)
2022-05-11T19:17:46.872363Z 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.35-38-57'  --binlog '/var/lib/mysql/binlog/mysql-bin' --gtid 'gtid:pos'

(I;ve replaced the ip and gtid values with just strings - they are valid values in this case)

1 Like

Just in case anyone runs into this, we solved this with setting the sst-idle-timeout variable in our my.cnf to 3600 seconds. Our guess was that during the transfer it took a bit longer to restore certain tables or unpack/transfer them so by the time the SST thread finished, it had automatically timed out and killed the connection. Not sure if thats how the SST process actually works, but bumping it caused previously failed SSTs to pass.

1 Like