Randomly IST fail

Hello,

first, thank you for your excellent products like xtrabackup, server and cluster (and so on…)

Very well, we have 5-node cluster, I graceful stopped 1st node ad the others still writing, and when I start the stopped node sometimes I have this message:

2015-05-20 12:45:38 29587 [Note] WSREP: New cluster view: global state: f7dc1bbb-f8b8-11e4-8172-3f0c0cf99383:165132, view# 59: Primary, number of nodes: 5, my index: 1, protocol version 3
2015-05-20 12:45:38 29587 [Warning] WSREP: Gap in state sequence. Need state transfer.
2015-05-20 12:45:38 29587 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘10.1.3.221’ --auth ‘sstuser:despegar#mysql’ --datadir ‘/mysql/data/’ --defaults-file ‘/etc/my.cnf’ --parent ‘29587’ ‘’ ’
.WSREP_SST: [INFO] Streaming with xbstream (20150520 12:45:39.638)
WSREP_SST: [INFO] Using socat as streamer (20150520 12:45:39.644)
2015-05-20 12:45:39 29587 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.1.3.221:4444/xtrabackup_sst//1
WSREP_SST: [INFO] Evaluating timeout -s9 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20150520 12:45:39.731)
2015/05/20 12:45:39 socat[29858] E bind(3, {AF=2 0.0.0.0:4444}, 16): Address already in use
WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 1 0 (20150520 12:45:39.748)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20150520 12:45:39.754)
2015-05-20 12:45:39 29587 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘10.1.3.221’ --auth ‘sstuser:despegar#mysql’ --datadir ‘/mysql/data/’ --defaults-file ‘/etc/my.cnf’ --parent ‘29587’ ‘’ : 32 (Broken pipe)
2015-05-20 12:45:39 29587 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2015-05-20 12:45:39 29587 [ERROR] WSREP: SST failed: 32 (Broken pipe)
2015-05-20 12:45:39 29587 [ERROR] Aborting

2015-05-20 12:45:39 29587 [Note] WSREP: REPL Protocols: 7 (3, 2)
2015-05-20 12:45:39 29587 [Note] WSREP: Service thread queue flushed.
2015-05-20 12:45:39 29587 [Note] WSREP: Assign initial position for certification: 165132, protocol version: 3
2015-05-20 12:45:39 29587 [Note] WSREP: Service thread queue flushed.
2015-05-20 12:45:39 29587 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.1.3.221:4568
2015-05-20 12:45:39 29587 [Note] WSREP: Member 1.0 (xtradb-cloudia-01) requested state transfer from ‘any’. Selected 4.0 (xtradb-cloudia-04)(SYNCED) as donor.
2015-05-20 12:45:39 29587 [Note] WSREP: Shifting PRIMARY → JOINER (TO: 165132)
2015-05-20 12:45:39 29587 [Note] WSREP: Requesting state transfer: success, donor: 4
State transfer in progress, setting sleep higher
.xbstream: Can’t create/write to file ‘./xtrabackup_galera_info’ (Errcode: 2 - No such file or directory)
xbstream: failed to create file.

Any idea?
Need any other data?
Thank you so much

Interest thing, when I retry to start this node, PXC do a SST.
And they’re not a lot of writes in order to do that.

It’s seems to be socat hang or take port 4444 after successful IST and when you try to stop/start MySQL again, port is taken.
A workaround is “killall -9 socat”
But I don’t like it.
This is a Galera or Percona bug?
Thanks.

Post your my.cnf, and if you still have it, post contents of innobackup.* logs on donor and joiner node

Here is my.cnf

[mysqld_safe]
pid-file = /home/despegar/mysql/mysql.pid

[mysqld]
user = despegar
port = 33033
socket = /home/despegar/mysql/mysql.sock
back_log = 50
datadir = /home/despegar/mysql/data
log-error = /home/despegar/mysql/mysql.err

max_connections = 10000
max_connect_errors = 99999
max_allowed_packet = 16M
skip-host-cache
skip-name-resolve

explicit_defaults_for_timestamp = 1
performance_schema = off

transaction_isolation = READ-COMMITTED
log-bin = /home/despegar/mysql/binlog/mysql-bin
slow_query_log = 1
log_output = TABLE
long_query_time = 5
event_scheduler = ON

*** INNODB Specific options ***

innodb_buffer_pool_size = 3G
innodb_data_file_path = ibdata1:256M:autoextend
innodb_thread_concurrency = 16
innodb_log_file_size = 256M
innodb_log_files_in_group = 3
innodb_file_per_table = ON
innodb_undo_tablespaces = 3
innodb_undo_directory = /home/despegar/mysql/undo

*** Galera Cluster Settings ***

binlog_format = ROW
default-storage-engine = innodb
innodb_autoinc_lock_mode = 2
log_slave_updates = ON
query_cache_size = 0
query_cache_type = 0

wsrep Provider Settings

wsrep_provider = /usr/lib64/galera3/libgalera_smm.so
wsrep_provider_options = “gcache.size=128M;gcache.page_size=128M”
wsrep_cluster_address = “gcomm://xtradblab00,xtradblab01,xtradblab02,xtradblab03”
wsrep_cluster_name = “xtradb-lab”
wsrep_node_address = “xtradblab00”
wsrep_node_name = “xtradblab00”
wsrep_sst_method = xtrabackup-v2
wsrep_sst_auth = “sstuser:despegar#mysql”
wsrep_node_incoming_address = “10.70.128.203:33033”
wsrep_sst_receive_address = “10.70.128.203”
wsrep_slave_threads = 8

innobackup.log not been created just for this reason (port in use)