Replication + Galera = Timeout?

I am having an issue with Galera that I’m attempting to track down. I have setup a three node cluster, with the ‘first’ node getting replication data from an existing, remote SQL server.

Things run okay for a few days and then randomly the other nodes lose connection to the ‘first’ node apparently and totally crash MySQL (as in it’s no longer running):

120726 16:10:35 [Note] WSREP: (0b066c90-d4fb-11e1-0800-86a2cf43aaf6, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://172.30.0.163:4567
120726 16:10:36 [Note] WSREP: (0b066c90-d4fb-11e1-0800-86a2cf43aaf6, ‘tcp://0.0.0.0:4567’) reconnecting to df4e387f-d4e2-11e1-0800-2e6080299165 (tcp://172.30.0.163:4567), attempt 0
120726 16:10:37 [Note] WSREP: remote endpoint tcp://172.30.0.163:4567 changed identity df4e387f-d4e2-11e1-0800-2e6080299165 → 1bc3616b-d777-11e1-0800-4a369f7eed28
120726 16:10:37 [Note] WSREP: (0b066c90-d4fb-11e1-0800-86a2cf43aaf6, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
120726 16:11:08 [Note] WSREP: evs::proto(0b066c90-d4fb-11e1-0800-86a2cf43aaf6, GATHER, view_id(REG,0b066c90-d4fb-11e1-0800-86a2cf43aaf6,11)) suspecting node: df4e387f-d4e2-11e1-0800-2e6080299165
120726 16:11:08 [Note] WSREP: (0b066c90-d4fb-11e1-0800-86a2cf43aaf6, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://172.30.0.163:4567

… and a little later:

120726 16:11:38 [Warning] WSREP: Failed to report last committed 16744356, -107 (Transport endpoint is not connected)
120726 16:11:38 [Note] WSREP: Received NON-PRIMARY.
120726 16:11:38 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 2
120726 16:11:38 [Note] WSREP: Flow-control interval: [12, 23]
120726 16:11:38 [Note] WSREP: Received NON-PRIMARY.
120726 16:11:38 [Note] WSREP: New cluster view: global state: 97e9eadb-d1fa-11e1-0800-d054b2ba0044:16744400, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
120726 16:11:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120726 16:11:38 [Note] WSREP: New cluster view: global state: 97e9eadb-d1fa-11e1-0800-d054b2ba0044:16744400, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
120726 16:11:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120726 16:11:38 [Note] WSREP: New cluster view: global state: 97e9eadb-d1fa-11e1-0800-d054b2ba0044:16744400, view# -1: non-Primary, number of nodes: 2, my index: 0, protocol version 2
120726 16:11:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

On the node that hard-crashes:

120726 16:10:34 mysqld_safe Number of processes running now: 0
120726 16:10:34 mysqld_safe mysqld restarted
120726 16:10:35 [Note] Flashcache bypass: disabled
120726 16:10:35 [Note] Flashcache setup error is : ioctl failed

120726 16:10:35 [Warning] You need to use --log-bin to make --log-slave-updates work.
120726 16:10:35 [Note] WSREP: Read nil XID from storage engines, skipping position init
120726 16:10:35 [Note] WSREP: wsrep_load(): loading provider library ‘/usr/lib64/libgalera_smm.so’
120726 16:10:36 [Note] WSREP: wsrep_load(): Galera 2.1(r113) by Codership Oy <info&#64;codership.com> loaded succesfully.
120726 16:10:36 [Note] WSREP: Found saved state: 97e9eadb-d1fa-11e1-0800-d054b2ba0044:-1
120726 16:10:36 [Note] WSREP: Reusing existing ‘/var/lib/mysql//galera.cache’.
120726 16:10:36 [Note] WSREP: Passing config to GCS: base_host = 172.30.0.163; evs.consensus_timeout = PT1M; evs.inactive_check_period = PT10S; evs.inactive_timeout = PT1M; evs.keepalive_period = PT3S; evs.send_window = 1024; evs.suspect_timeout = PT30S; evs.user_send_window = 512; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 1G; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
120726 16:10:36 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
120726 16:10:36 [Note] WSREP: wsrep_sst_grab()
120726 16:10:36 [Note] WSREP: Start replication
120726 16:10:36 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
120726 16:10:36 [Note] WSREP: (1bc3616b-d777-11e1-0800-4a369f7eed28, ‘tcp://0.0.0.0:4567’) listening at tcp://0.0.0.0:4567
120726 16:10:36 [Note] WSREP: (1bc3616b-d777-11e1-0800-4a369f7eed28, ‘tcp://0.0.0.0:4567’) multicast: , ttl: 1
120726 16:10:36 [Note] WSREP: EVS version 0
120726 16:10:36 [Note] WSREP: PC version 0
120726 16:10:36 [Note] WSREP: gcomm: connecting to group ‘galeraprimary’, peer ‘galera1.torreycommerce.net:4567
120726 16:10:36 [Note] WSREP: (1bc3616b-d777-11e1-0800-4a369f7eed28, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://172.30.0.154:4567
120726 16:10:36 [Note] WSREP: (1bc3616b-d777-11e1-0800-4a369f7eed28, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
120726 16:11:06 [Note] WSREP: view((empty))
120726 16:11:06 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():148
120726 16:11:06 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -110 (Connection timed out)
120726 16:11:06 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel ‘galeraprimary’ at ‘gcomm://galera1.torreycommerce.net:4567’: -110 (Connection timed out)
120726 16:11:06 [ERROR] WSREP: gcs connect failed: Connection timed out
120726 16:11:06 [ERROR] WSREP: wsrep::connect() failed: 6
120726 16:11:06 [ERROR] Aborting

120726 16:11:06 [Note] WSREP: Service disconnected.
120726 16:11:07 [Note] WSREP: Some threads may fail to exit.

Any ideas what is going on here?

No one has any ideas? This is still an issue for me… random timeouts that don’t (or rather can’t be) network related.

I just did some reading and it’s very possible that there is still a bug with log_bin in place. I’ll try disabling and see if it fixes the disconnects. If it does, should I file a bug report with the Codership team, or Percona?