SST dies after 120 second Stall after transferring 300+GB on AWS only

Hey folks, running into a weird problem within an AWS DC we have, SST transfers work for a while, then after 300+ GB of transfer they suddenly stall for 120+ seconds and timeout. We are sort of working around this using ncat and scps, but without SSTs working thats a lot of stuff to maintain. We are trying an SST between nodes, but we keep getting the error:
“ERROR] Killing SST (2319) with SIGKILL after stalling for 120 seconds”

Basic setup:

  • Docker container on ec2 instances
  • So far as we know, the security groups are setup in such a way that SST/galera comms ports work without issue especially between nodes in the cluster.
  • Using percona 5.7.35-38-57 and 5.7.36-39-57 versions, both fail.

We lowered the net.ipv4.tcp_keepalive_time=60 thinking this might help and it seems to have helped the DB go from 300GB transfer to ~800GB transfer. Sadly we have 1.7TB of data to transfer so we’re falling short of the goal. This seems to only happen to our clusters in our us-east-2 DC in AWS. Co-los and other AWS regions seem relatively unaffected (though we don’t have as large databases in our other AWS regions).

Here’s the log from the receiver side (in a docker container):

2022-05-11T17:16:01.120928Z 0 [Note] WSREP: (052bf080, 'ssl://0.0.0.0:4567') turning message relay requesting off
	2022-05-11T19:17:45.796316Z WSREP_SST: [ERROR] Killing SST (2319) with SIGKILL after stalling for 120 seconds
/usr/bin/wsrep_sst_xtrabackup-v2: line 195:  2321 Killed                  socat -u openssl-listen:4444,reuseaddr,cert=/var/lib/mysql/certs/cert.pem,key=/var/lib/mysql/certs/key.pem,cafile=/var/lib/mysql/certs/ca.pem,verify=1,retry=30 stdio
      2322                       | lz4 -d
      2323                       | xbstream -x $xbstream_eopts
	2022-05-11T19:17:45.801127Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
	2022-05-11T19:17:45.802282Z WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 137 137 137
	2022-05-11T19:17:45.803401Z WSREP_SST: [ERROR] ******************************************************
2022-05-11T19:17:45.804955Z WSREP_SST: [ERROR] Cleanup after exit with status:32
2022-05-11T19:17:45.810291Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.66.68.8:4444' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '1708' --mysqld-version '5.7.35-38-57'  --binlog '/var/lib/mysql/binlog/mysql-bin' : 32 (Broken pipe)
2022-05-11T19:17:45.810332Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2022-05-11T19:17:45.810343Z 0 [ERROR] WSREP: SST script aborted with error 32 (Broken pipe)
2022-05-11T19:17:45.810386Z 0 [ERROR] WSREP: SST failed: 32 (Broken pipe)

From the sender its pretty uninteresting, just showing that it timed out and failed to transfer:

xtrabackup: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
xb_stream_write_data() failed.
xtrabackup: Error writing file 'UNOPENED' (Errcode: 32 - Broken pip

and

  2022-05-11T19:17:46.861872Z WSREP_SST: [ERROR] ******************************************************
        2022-05-11T19:17:46.863480Z WSREP_SST: [ERROR] Cleanup after exit with status:22
2022-05-11T19:17:46.872298Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.1.1:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.35-38-57'  --binlog '/var/lib/mysql/binlog/mysql-bin' --gtid 'gtid:pos' : 22 (Invalid argument)
2022-05-11T19:17:46.872363Z 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.1.1:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/data/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.35-38-57'  --binlog '/var/lib/mysql/binlog/mysql-bin' --gtid 'gtid:pos'

(I;ve replaced the ip and gtid values with just strings - they are valid values in this case)

1 Like