3 member cluster, 1 node keeps having to rejoin and IST

We have been running a 3 node cluster in production for about 2 months now, upgraded from 5.5 to 5.6 latest and Galera 3x with out issues. We use the third node (call it C) as an async master to a 5.6 slave running in our offsite location over an ssh tunnel. This node (C) does not serve any traffic. Also, we send all writes to Node A, and use Node A and B for reads.

The problem that we are seeing is that the third node, in this case C always seems to disconnect from the cluster, and rejoins performing an IST. I have no idea why this is happening but it seems also to corrupt the async slave when it happens.

Here is the full transaction of the cluster member C, leaving and joining the cluster. We have reinstalled this member on new hardware, renamed it, no matter what it always disconnects. The other members are humming along just fine. It seems to be limited to the third joining node.

ERROR LOG - attachments don’t work 2014-05-18 14:31:43 3167 [Note] WSREP: (c008a8e1-de1c-11e3-913f-d6d7b3e7ba76, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://XXX.XXX:4567
2014-05-18 14:31:43 3167 [Note] WSREP: (c008a8e1-de1c-11e3-913f-d6d7b3e7ba76, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2014-05-18 14:31:43 3167 [Warning] WSREP: last inactive check more than PT1.5S ago (PT5.55984S), skipping check
2014-05-18 14:31:44 3167 [Note] WSREP: view(view_id(NON_PRIM,1469ea44-daff-11e3-97a5-8bd469d0b8fc,376) memb {
c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,0
} joined {
} left {
} partitioned {
1469ea44-daff-11e3-97a5-8bd469d0b8fc,0
d62ce8fe-dad9-11e3-a835-d29c74758152,0
})
2014-05-18 14:31:44 3167 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2014-05-18 14:31:44 3167 [Note] WSREP: Flow-control interval: [500, 500]
2014-05-18 14:31:44 3167 [Note] WSREP: Received NON-PRIMARY.
2014-05-18 14:31:44 3167 [Note] WSREP: Shifting SYNCED → OPEN (TO: 58520762)
2014-05-18 14:31:44 3167 [Note] WSREP: New cluster view: global state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520762, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
2014-05-18 14:31:44 3167 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-18 14:31:44 3167 [Note] WSREP: view(view_id(NON_PRIM,c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,377) memb {
c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,0
} joined {
} left {
} partitioned {
1469ea44-daff-11e3-97a5-8bd469d0b8fc,0
d62ce8fe-dad9-11e3-a835-d29c74758152,0
})
2014-05-18 14:31:44 3167 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2014-05-18 14:31:44 3167 [Note] WSREP: Flow-control interval: [500, 500]
2014-05-18 14:31:44 3167 [Note] WSREP: Received NON-PRIMARY.
2014-05-18 14:31:44 3167 [Note] WSREP: New cluster view: global state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520762, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
2014-05-18 14:31:44 3167 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-18 14:31:45 3167 [Note] WSREP: declaring 1469ea44-daff-11e3-97a5-8bd469d0b8fc stable
2014-05-18 14:31:45 3167 [Note] WSREP: declaring d62ce8fe-dad9-11e3-a835-d29c74758152 stable
2014-05-18 14:31:46 3167 [Note] WSREP: Node 1469ea44-daff-11e3-97a5-8bd469d0b8fc state prim
2014-05-18 14:31:46 3167 [Note] WSREP: view(view_id(PRIM,1469ea44-daff-11e3-97a5-8bd469d0b8fc,378) memb {
1469ea44-daff-11e3-97a5-8bd469d0b8fc,0
c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,0
d62ce8fe-dad9-11e3-a835-d29c74758152,0
} joined {
} left {
} partitioned {
})
2014-05-18 14:31:46 3167 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
2014-05-18 14:31:46 3167 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2014-05-18 14:31:46 3167 [Note] WSREP: STATE EXCHANGE: sent state msg: 23c6bd5c-de99-11e3-9cd7-bb91f8f93683
2014-05-18 14:31:46 3167 [Note] WSREP: STATE EXCHANGE: got state msg: 23c6bd5c-de99-11e3-9cd7-bb91f8f93683 from 0 (pxc-01)
2014-05-18 14:31:46 3167 [Note] WSREP: STATE EXCHANGE: got state msg: 23c6bd5c-de99-11e3-9cd7-bb91f8f93683 from 1 (pxc-04)
2014-05-18 14:31:46 3167 [Note] WSREP: STATE EXCHANGE: got state msg: 23c6bd5c-de99-11e3-9cd7-bb91f8f93683 from 2 (pxc-02)
2014-05-18 14:31:46 3167 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 295,
members = 2/3 (joined/total),
act_id = 58520767,
last_appl. = 58520707,
protocols = 0/5/2 (gcs/repl/appl),
group UUID = 95890958-8d33-11e3-aea7-ff8c6255f30b
2014-05-18 14:31:46 3167 [Note] WSREP: Flow-control interval: [500, 500]
2014-05-18 14:31:46 3167 [Note] WSREP: Shifting OPEN → PRIMARY (TO: 58520767)
2014-05-18 14:31:46 3167 [Note] WSREP: State transfer required:
Group state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520767
Local state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520762
2014-05-18 14:31:46 3167 [Note] WSREP: New cluster view: global state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520767, view# 296: Primary, number of nodes: 3, my index: 1, protocol version 2
2014-05-18 14:31:46 3167 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-05-18 14:31:48 3167 [Note] WSREP: You have configured ‘xtrabackup’ state snapshot transfer method which cannot be performed on a running server. Wsrep provider won’t be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
2014-05-18 14:31:48 3167 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-18 14:31:48 3167 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-05-18 14:31:48 3167 [Note] WSREP: Service thread queue flushed.
2014-05-18 14:31:48 3167 [Note] WSREP: Assign initial position for certification: 58520767, protocol version: 3
2014-05-18 14:31:48 3167 [Note] WSREP: Service thread queue flushed.
2014-05-18 14:31:48 3167 [Warning] WSREP: Releasing seqno 58520762 before 58520763 was assigned.
2014-05-18 14:31:48 3167 [Note] WSREP: Prepared IST receiver, listening at: tcp://XXX.XXX:4568
2014-05-18 14:31:48 3167 [Note] WSREP: Member 1.0 (pxc-04) requested state transfer from ‘pxc-02’. Selected 2.0 (pxc-02)(SYNCED) as donor.
2014-05-18 14:31:48 3167 [Note] WSREP: Shifting PRIMARY → JOINER (TO: 58520784)
2014-05-18 14:31:48 3167 [Note] WSREP: Requesting state transfer: success, donor: 2
2014-05-18 14:31:48 3167 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 0 bytes
2014-05-18 14:31:48 3167 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 1/17 locked buffers
2014-05-18 14:31:48 3167 [Note] WSREP: Receiving IST: 5 writesets, seqnos 58520762-58520767
2014-05-18 14:31:48 3167 [Note] WSREP: IST received: 95890958-8d33-11e3-aea7-ff8c6255f30b:58520767
2014-05-18 14:31:48 3167 [Note] WSREP: 1.0 (pxc-04): State transfer from 2.0 (pxc-02) complete.
2014-05-18 14:31:48 3167 [Note] WSREP: Shifting JOINER → JOINED (TO: 58520784)
2014-05-18 14:31:48 3167 [Note] WSREP: 2.0 (pxc-02): State transfer to 1.0 (pxc-04) complete.
2014-05-18 14:31:48 3167 [Note] WSREP: Member 1.0 (pxc-04) synced with group.
2014-05-18 14:31:48 3167 [Note] WSREP: Shifting JOINED → SYNCED (TO: 58520784)
2014-05-18 14:31:48 3167 [Note] WSREP: Member 2.0 (pxc-02) synced with group.
2014-05-18 14:31:48 3167 [Note] WSREP: Synchronized with group, ready for connections
2014-05-18 14:31:48 3167 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

This is my.cnf on member C
[mysql]

port = 3306
socket = /var/run/mysqld/mysqld.sock

[client]

user = redacted
password = redacted

[mysqld_safe]

[mysqld]
bind-address = 192.168.xxx.xxx
port = 3306

CLUSTER

server_id = redacted
binlog_format = ROW
log_bin = mysql-bin
wsrep_cluster_address = gcomm://192.168.xxx.xxx,192.168.xxx.xxx

wsrep_cluster_address = gcomm:// # do this for the first one only

wsrep_sst_donor = pxc-02
wsrep_node_address = 192.168.xxx.xxx
wsrep_provider = /usr/lib/libgalera_smm.so
wsrep_provider_options = “gcache.size=3G; gcs.fc_limit=500; gcs.fc_master_slave=YES; gcs.fc_factor=1.0”
wsrep_slave_threads = 16
wsrep_cluster_name = REDACTED
wsrep_sst_method = xtrabackup
wsrep_sst_auth = redacted:redacted
wsrep_node_name = pxc-04
log_slave_updates
innodb_locks_unsafe_for_binlog = 1
innodb_autoinc_lock_mode = 2

GENERAL

user = mysql
default-storage-engine = InnoDB
socket = /var/run/mysqld/mysqld.sock
pid-file = /var/run/mysqld/mysql.pid

MyISAM

key-buffer-size = 32M
myisam-recover = FORCE,BACKUP

SAFETY

max-allowed-packet = 16M
max-connect-errors = 1000000
skip-name-resolve
sql-mode = STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE
sysdate-is-now = 1
#innodb = FORCE

DATA STORAGE

datadir = /data/mysql/

BINARY LOGGING

log-bin = /data/mysql/mysql-bin
expire-logs-days = 10
sync-binlog = 0
gtid_mode = on #Global Transaction ID for async slaves
enforce_gtid_consistency = 1

CACHES AND LIMITS

tmp-table-size = 32M
max-heap-table-size = 32M
query-cache-type = 0
query-cache-size = 0
max-connections = 1000
thread-cache-size = 50
open-files-limit = 65535
table-definition-cache = 4096
table-open-cache = 4096

INNODB

innodb-flush-method = O_DIRECT
innodb-log-files-in-group = 2
innodb-log-file-size = 512M
innodb-flush-log-at-trx-commit = 1
innodb-file-per-table = 1
innodb-buffer-pool-size = 26G

LOGGING

log-error = /var/log/mysql/mysql-error.log
log-queries-not-using-indexes = 1
slow-query-log = 1
slow-query-log-file = /var/log/mysql/mysql-slow.log

This is a corresponding connection from the donor side of the transaction.
2014-05-18 19:36:45 2897 [Note] WSREP: (d62ce8fe-dad9-11e3-a835-d29c74758152, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.X.X:4567
2014-05-18 19:36:46 2897 [Note] WSREP: (d62ce8fe-dad9-11e3-a835-d29c74758152, ‘tcp://0.0.0.0:4567’) reconnecting to c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.X.X:4567), attempt 0
2014-05-18 19:36:48 2897 [Note] WSREP: (d62ce8fe-dad9-11e3-a835-d29c74758152, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2014-05-18 19:36:48 2897 [Note] WSREP: evs::proto(d62ce8fe-dad9-11e3-a835-d29c74758152, GATHER, view_id(REG,1469ea44-daff-11e3-97a5-8bd469d0b8fc,394)) suspecting node: c008a8e1-de1c-11e3-913f-d6d7b3e7ba76
2014-05-18 19:36:48 2897 [Note] WSREP: declaring 1469ea44-daff-11e3-97a5-8bd469d0b8fc stable
2014-05-18 19:36:48 2897 [Note] WSREP: Node 1469ea44-daff-11e3-97a5-8bd469d0b8fc state prim
2014-05-18 19:36:48 2897 [Note] WSREP: view(view_id(PRIM,1469ea44-daff-11e3-97a5-8bd469d0b8fc,395) memb {
1469ea44-daff-11e3-97a5-8bd469d0b8fc,0
d62ce8fe-dad9-11e3-a835-d29c74758152,0
} joined {
} left {
} partitioned {
c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,0
})
2014-05-18 19:36:48 2897 [Note] WSREP: forgetting c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.31.58:4567)
2014-05-18 19:36:48 2897 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2014-05-18 19:36:48 2897 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2014-05-18 19:36:48 2897 [Note] WSREP: STATE EXCHANGE: sent state msg: c0c6d62a-dec3-11e3-8037-5f9383edae98
2014-05-18 19:36:48 2897 [Note] WSREP: STATE EXCHANGE: got state msg: c0c6d62a-dec3-11e3-8037-5f9383edae98 from 0 (pxc-01)
2014-05-18 19:36:48 2897 [Note] WSREP: STATE EXCHANGE: got state msg: c0c6d62a-dec3-11e3-8037-5f9383edae98 from 1 (pxc-02)
2014-05-18 19:36:48 2897 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 307,
members = 2/2 (joined/total),
act_id = 58625222,
last_appl. = 58625196,
protocols = 0/5/2 (gcs/repl/appl),
group UUID = 95890958-8d33-11e3-aea7-ff8c6255f30b
2014-05-18 19:36:48 2897 [Note] WSREP: Flow-control interval: [23, 23]
2014-05-18 19:36:48 2897 [Note] WSREP: New cluster view: global state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58625222, view# 308: Primary, number of nodes: 2, my index: 1, protocol version 2
2014-05-18 19:36:48 2897 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-18 19:36:48 2897 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-05-18 19:36:48 2897 [Note] WSREP: Assign initial position for certification: 58625222, protocol version: 3
2014-05-18 19:36:48 2897 [Note] WSREP: Service thread queue flushed.
2014-05-18 19:36:50 2897 [Warning] WSREP: discarding established (time wait) c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.31.58:4567)
2014-05-18 19:36:51 2897 [Warning] WSREP: discarding established (time wait) c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.31.58:4567)
2014-05-18 19:36:52 2897 [Warning] WSREP: discarding established (time wait) c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.31.58:4567)
2014-05-18 19:36:53 2897 [Note] WSREP: cleaning up c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 (tcp://192.168.31.58:4567)
2014-05-18 19:36:54 2897 [Note] WSREP: (d62ce8fe-dad9-11e3-a835-d29c74758152, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.31.58:4567
2014-05-18 19:36:54 2897 [Note] WSREP: (d62ce8fe-dad9-11e3-a835-d29c74758152, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2014-05-18 19:36:55 2897 [Note] WSREP: declaring 1469ea44-daff-11e3-97a5-8bd469d0b8fc stable
2014-05-18 19:36:55 2897 [Note] WSREP: declaring c008a8e1-de1c-11e3-913f-d6d7b3e7ba76 stable
2014-05-18 19:36:55 2897 [Note] WSREP: Node 1469ea44-daff-11e3-97a5-8bd469d0b8fc state prim
2014-05-18 19:36:55 2897 [Note] WSREP: view(view_id(PRIM,1469ea44-daff-11e3-97a5-8bd469d0b8fc,396) memb {
1469ea44-daff-11e3-97a5-8bd469d0b8fc,0
c008a8e1-de1c-11e3-913f-d6d7b3e7ba76,0
d62ce8fe-dad9-11e3-a835-d29c74758152,0
} joined {
} left {
} partitioned {
})
2014-05-18 19:36:55 2897 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
2014-05-18 19:36:55 2897 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2014-05-18 19:36:55 2897 [Note] WSREP: STATE EXCHANGE: sent state msg: c4e9cfd1-dec3-11e3-a389-3b18183d7405
2014-05-18 19:36:55 2897 [Note] WSREP: STATE EXCHANGE: got state msg: c4e9cfd1-dec3-11e3-a389-3b18183d7405 from 0 (pxc-01)
2014-05-18 19:36:55 2897 [Note] WSREP: STATE EXCHANGE: got state msg: c4e9cfd1-dec3-11e3-a389-3b18183d7405 from 1 (pxc-04)
2014-05-18 19:36:55 2897 [Note] WSREP: STATE EXCHANGE: got state msg: c4e9cfd1-dec3-11e3-a389-3b18183d7405 from 2 (pxc-02)
2014-05-18 19:36:55 2897 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 308,
members = 2/3 (joined/total),
act_id = 58625267,
last_appl. = 58625196,
protocols = 0/5/2 (gcs/repl/appl),
group UUID = 95890958-8d33-11e3-aea7-ff8c6255f30b
2014-05-18 19:36:55 2897 [Note] WSREP: Flow-control interval: [28, 28]
2014-05-18 19:36:55 2897 [Note] WSREP: New cluster view: global state: 95890958-8d33-11e3-aea7-ff8c6255f30b:58625267, view# 309: Primary, number of nodes: 3, my index: 2, protocol version 2
2014-05-18 19:36:55 2897 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-18 19:36:55 2897 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-05-18 19:36:55 2897 [Note] WSREP: Assign initial position for certification: 58625267, protocol version: 3
2014-05-18 19:36:55 2897 [Note] WSREP: Service thread queue flushed.
2014-05-18 19:36:57 2897 [Note] WSREP: Node 1.0 (pxc-04) requested state transfer from ‘pxc-02’. Selected 2.0 (pxc-02)(SYNCED) as donor.
2014-05-18 19:36:57 2897 [Note] WSREP: Shifting SYNCED → DONOR/DESYNCED (TO: 58625283)
2014-05-18 19:36:57 2897 [Note] WSREP: IST request: 95890958-8d33-11e3-aea7-ff8c6255f30b:58625215-58625267|tcp://192.168.X.X:4568
2014-05-18 19:36:57 2897 [Note] WSREP: async IST sender starting to serve tcp://192.168.X.X:4568 sending 58625216-58625267
2014-05-18 19:36:57 2897 [Note] WSREP: async IST sender served
2014-05-18 19:36:57 2897 [Note] WSREP: 2.0 (pxc-02): State transfer to 1.0 (pxc-04) complete.
2014-05-18 19:36:57 2897 [Note] WSREP: Shifting DONOR/DESYNCED → JOINED (TO: 58625283)
2014-05-18 19:36:57 2897 [Note] WSREP: Member 2 (pxc-02) synced with group.
2014-05-18 19:36:57 2897 [Note] WSREP: Shifting JOINED → SYNCED (TO: 58625283)
2014-05-18 19:36:57 2897 [Note] WSREP: Synchronized with group, ready for connections
2014-05-18 19:36:57 2897 [Note] WSREP: 1.0 (pxc-04): State transfer from 2.0 (pxc-02) complete.
2014-05-18 19:36:57 2897 [Note] WSREP: Member 1 (pxc-04) synced with group.

The slave end of failing:
Last_IO_Errno: 1593
Last_IO_Error: The slave I/O thread stops because a fatal error is encountered when it tried to SELECT @master_binlog_checksum. Error: Unknown command