SST Failure XtraDB Cluster 5.6.28

I had a 3-node XtraDB Cluster running, configured based on the Percona 3-node tutorial. All nodes were running and synchronising correctly.

The node 2 host restarted unexpectedly on Friday and node 2 failed to rejoin the cluster, receiving the following errors on the donor and joiner;

JOINER


2016-10-28 14:45:09 7693 [Note] WSREP: save pc into disk
2016-10-28 14:45:09 7693 [Note] WSREP: gcomm: connected
2016-10-28 14:45:09 7693 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2016-10-28 14:45:09 7693 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2016-10-28 14:45:09 7693 [Note] WSREP: Opened channel 'cluster_name_removed'
2016-10-28 14:45:09 7693 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2016-10-28 14:45:09 7693 [Note] WSREP: Waiting for SST to complete.
2016-10-28 14:45:09 7693 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 1fc6d5a4-9d1d-11e6-af21-c22d66a9359c
2016-10-28 14:45:09 7693 [Note] WSREP: STATE EXCHANGE: sent state msg: 1fc6d5a4-9d1d-11e6-af21-c22d66a9359c
2016-10-28 14:45:09 7693 [Note] WSREP: STATE EXCHANGE: got state msg: 1fc6d5a4-9d1d-11e6-af21-c22d66a9359c from 0 ()
2016-10-28 14:45:09 7693 [Note] WSREP: STATE EXCHANGE: got state msg: 1fc6d5a4-9d1d-11e6-af21-c22d66a9359c from 1 ()
2016-10-28 14:45:09 7693 [Note] WSREP: STATE EXCHANGE: got state msg: 1fc6d5a4-9d1d-11e6-af21-c22d66a9359c from 2 ()
2016-10-28 14:45:09 7693 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 20,
members = 2/3 (joined/total),
act_id = 165175,
last_appl. = -1,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = e63dad80-1822-11e6-a809-dfaf2ba1bb52
2016-10-28 14:45:09 7693 [Note] WSREP: Flow-control interval: [28, 28]
2016-10-28 14:45:09 7693 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 165175)
2016-10-28 14:45:09 7693 [Note] WSREP: State transfer required: 
Group state: e63dad80-1822-11e6-a809-dfaf2ba1bb52:165175
Local state: 00000000-0000-0000-0000-000000000000:-1
2016-10-28 14:45:09 7693 [Note] WSREP: New cluster view: global state: e63dad80-1822-11e6-a809-dfaf2ba1bb52:165175, view# 21: Primary, number of nodes: 3, my index: 0, protocol version 3
2016-10-28 14:45:09 7693 [Warning] WSREP: Gap in state sequence. Need state transfer.
2016-10-28 14:45:09 7693 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.1.1' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '7693' '' '
WSREP_SST: [INFO] Streaming with xbstream (20161028 14:45:09.529)
WSREP_SST: [INFO] Using socat as streamer (20161028 14:45:09.530)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20161028 14:45:09.535)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20161028 14:45:09.556)
2016-10-28 14:45:09 7693 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.1.1.1:4444/xtrabackup_sst//1
2016-10-28 14:45:09 7693 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-10-28 14:45:09 7693 [Note] WSREP: REPL Protocols: 7 (3, 2)
2016-10-28 14:45:09 7693 [Note] WSREP: Service thread queue flushed.
2016-10-28 14:45:09 7693 [Note] WSREP: Assign initial position for certification: 165175, protocol version: 3
2016-10-28 14:45:09 7693 [Note] WSREP: Service thread queue flushed.
2016-10-28 14:45:09 7693 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (e63dad80-1822-11e6-a809-dfaf2ba1bb52): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():489. IST will be unavailable.
2016-10-28 14:45:09 7693 [Note] WSREP: Member 0.0 () requested state transfer from '*any*'. Selected 1.0 ()(SYNCED) as donor.
2016-10-28 14:45:09 7693 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 165175)
2016-10-28 14:45:09 7693 [Note] WSREP: Requesting state transfer: success, donor: 1
2016-10-28 14:45:09 7693 [Warning] WSREP: 1.0 (): State transfer to 0.0 () failed: -32 (Broken pipe)
2016-10-28 14:45:09 7693 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():733: Will never receive state. Need to abort.
2016-10-28 14:45:09 7693 [Note] WSREP: gcomm: terminating thread
2016-10-28 14:45:09 7693 [Note] WSREP: gcomm: joining thread
2016-10-28 14:45:09 7693 [Note] WSREP: gcomm: closing backend
2016-10-28 14:45:09 7693 [Note] WSREP: view(view_id(NON_PRIM,1f7a5586,21) memb {
1f7a5586,0
} joined {
} left {
} partitioned {
4da282d1,0
b385bdf3,0
})
2016-10-28 14:45:09 7693 [Note] WSREP: view((empty))
2016-10-28 14:45:09 7693 [Note] WSREP: gcomm: closed
2016-10-28 14:45:09 7693 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Aborted
161028 14:45:09 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

DONER


2016-10-28 14:45:08 3080 [Note] WSREP: (b385bdf3, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2016-10-28 14:45:09 3080 [Note] WSREP: declaring 1f7a5586 at tcp://10.1.1.2:4567 stable
2016-10-28 14:45:09 3080 [Note] WSREP: declaring 4da282d1 at tcp://10.1.1.3:4567 stable
2016-10-28 14:45:09 3080 [Note] WSREP: Node 4da282d1 state prim
2016-10-28 14:45:09 3080 [Note] WSREP: view(view_id(PRIM,1f7a5586,21) memb {
1f7a5586,0
4da282d1,0
b385bdf3,0
} joined {
} left {
} partitioned {
})
2016-10-28 14:45:09 3080 [Note] WSREP: save pc into disk
2016-10-28 14:45:09 3080 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
2016-10-28 14:45:09 3080 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2016-10-28 14:45:09 3080 [Note] WSREP: STATE EXCHANGE: sent state msg: 1fc6d5a4-9d1d-11e6-af21-c22d66a9359c
2016-10-28 14:45:09 3080 [Note] WSREP: STATE EXCHANGE: got state msg: 1fc6d5a4-9d1d-11e6-af21-c22d66a9359c from 0 ()
2016-10-28 14:45:09 3080 [Note] WSREP: STATE EXCHANGE: got state msg: 1fc6d5a4-9d1d-11e6-af21-c22d66a9359c from 1 ()
2016-10-28 14:45:09 3080 [Note] WSREP: STATE EXCHANGE: got state msg: 1fc6d5a4-9d1d-11e6-af21-c22d66a9359c from 2 ()
2016-10-28 14:45:09 3080 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 20,
members = 2/3 (joined/total),
act_id = 165175,
last_appl. = 165141,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = e63dad80-1822-11e6-a809-dfaf2ba1bb52
2016-10-28 14:45:09 3080 [Note] WSREP: Flow-control interval: [28, 28]
2016-10-28 14:45:09 3080 [Note] WSREP: New cluster view: global state: e63dad80-1822-11e6-a809-dfaf2ba1bb52:165175, view# 21: Primary, number of nodes: 3, my index: 2, protocol version 3
2016-10-28 14:45:09 3080 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-10-28 14:45:09 3080 [Note] WSREP: REPL Protocols: 7 (3, 2)
2016-10-28 14:45:09 3080 [Note] WSREP: Service thread queue flushed.
2016-10-28 14:45:09 3080 [Note] WSREP: Assign initial position for certification: 165175, protocol version: 3
2016-10-28 14:45:09 3080 [Note] WSREP: Service thread queue flushed.
2016-10-28 14:45:09 3080 [Note] WSREP: Member 0.0 () requested state transfer from '*any*'. Selected 1.0 ()(SYNCED) as donor.
2016-10-28 14:45:09 3080 [Warning] WSREP: 1.0 (): State transfer to 0.0 () failed: -32 (Broken pipe)
2016-10-28 14:45:09 3080 [Note] WSREP: Member 1.0 () synced with group.
2016-10-28 14:45:09 3080 [Note] WSREP: declaring 4da282d1 at tcp://10.1.1.3:4567 stable
2016-10-28 14:45:09 3080 [Note] WSREP: forgetting 1f7a5586 (tcp://10.1.1.2:4567)
2016-10-28 14:45:09 3080 [Note] WSREP: Node 4da282d1 state prim
2016-10-28 14:45:09 3080 [Note] WSREP: view(view_id(PRIM,4da282d1,22) memb {
4da282d1,0
b385bdf3,0
} joined {
} left {
} partitioned {
1f7a5586,0
})
2016-10-28 14:45:09 3080 [Note] WSREP: save pc into disk
2016-10-28 14:45:09 3080 [Note] WSREP: forgetting 1f7a5586 (tcp://10.1.1.2:4567)
2016-10-28 14:45:09 3080 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2016-10-28 14:45:09 3080 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2016-10-28 14:45:09 3080 [Note] WSREP: STATE EXCHANGE: sent state msg: 200bc7c5-9d1d-11e6-a78e-4e185c10db21
2016-10-28 14:45:09 3080 [Note] WSREP: STATE EXCHANGE: got state msg: 200bc7c5-9d1d-11e6-a78e-4e185c10db21 from 0 ()
2016-10-28 14:45:09 3080 [Note] WSREP: STATE EXCHANGE: got state msg: 200bc7c5-9d1d-11e6-a78e-4e185c10db21 from 1 ()
2016-10-28 14:45:09 3080 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 21,
members = 2/2 (joined/total),
act_id = 165175,
last_appl. = 165141,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = e63dad80-1822-11e6-a809-dfaf2ba1bb52
2016-10-28 14:45:09 3080 [Note] WSREP: Flow-control interval: [23, 23]
2016-10-28 14:45:09 3080 [Note] WSREP: New cluster view: global state: e63dad80-1822-11e6-a809-dfaf2ba1bb52:165175, view# 22: Primary, number of nodes: 2, my index: 1, protocol version 3
2016-10-28 14:45:09 3080 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-10-28 14:45:09 3080 [Note] WSREP: REPL Protocols: 7 (3, 2)
2016-10-28 14:45:09 3080 [Note] WSREP: Service thread queue flushed.
2016-10-28 14:45:09 3080 [Note] WSREP: Assign initial position for certification: 165175, protocol version: 3
2016-10-28 14:45:09 3080 [Note] WSREP: Service thread queue flushed.

This scenario doesn’t appear to match anything defined from Google. There doesn’t seem to be an obvious ‘error’ as such in either the donor or the joiner, the only error I can see is the broken pipe error.

The firewall is completely open between the two boxes as they’re on a private network and using ubuntu defaults.

The sstuser was created and updated in my.cnf as per the Percona instructions for creating a 3-node cluster.

I don’t know what I can try.

Can anybody help?

Not sure if you got this working but it seems like issue with n/w connectivity.

Here is related blog too about it
[url]https://www.percona.com/blog/2014/12/30/diagnosing-sst-errors-with-percona-xtradb-cluster-for-mysql/[/url]