I am having an issue with Galera that I’m attempting to track down. I have setup a three node cluster, with the ‘first’ node getting replication data from an existing, remote SQL server.
Things run okay for a few days and then randomly the other nodes lose connection to the ‘first’ node apparently and totally crash MySQL (as in it’s no longer running):
120726 16:10:35 [Note] WSREP: (0b066c90-d4fb-11e1-0800-86a2cf43aaf6, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://172.30.0.163:4567
120726 16:10:36 [Note] WSREP: (0b066c90-d4fb-11e1-0800-86a2cf43aaf6, ‘tcp://0.0.0.0:4567’) reconnecting to df4e387f-d4e2-11e1-0800-2e6080299165 (tcp://172.30.0.163:4567), attempt 0
120726 16:10:37 [Note] WSREP: remote endpoint tcp://172.30.0.163:4567 changed identity df4e387f-d4e2-11e1-0800-2e6080299165 → 1bc3616b-d777-11e1-0800-4a369f7eed28
120726 16:10:37 [Note] WSREP: (0b066c90-d4fb-11e1-0800-86a2cf43aaf6, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
120726 16:11:08 [Note] WSREP: evs::proto(0b066c90-d4fb-11e1-0800-86a2cf43aaf6, GATHER, view_id(REG,0b066c90-d4fb-11e1-0800-86a2cf43aaf6,11)) suspecting node: df4e387f-d4e2-11e1-0800-2e6080299165
120726 16:11:08 [Note] WSREP: (0b066c90-d4fb-11e1-0800-86a2cf43aaf6, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://172.30.0.163:4567
… and a little later:
120726 16:11:38 [Warning] WSREP: Failed to report last committed 16744356, -107 (Transport endpoint is not connected)
120726 16:11:38 [Note] WSREP: Received NON-PRIMARY.
120726 16:11:38 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 2
120726 16:11:38 [Note] WSREP: Flow-control interval: [12, 23]
120726 16:11:38 [Note] WSREP: Received NON-PRIMARY.
120726 16:11:38 [Note] WSREP: New cluster view: global state: 97e9eadb-d1fa-11e1-0800-d054b2ba0044:16744400, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
120726 16:11:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120726 16:11:38 [Note] WSREP: New cluster view: global state: 97e9eadb-d1fa-11e1-0800-d054b2ba0044:16744400, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
120726 16:11:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120726 16:11:38 [Note] WSREP: New cluster view: global state: 97e9eadb-d1fa-11e1-0800-d054b2ba0044:16744400, view# -1: non-Primary, number of nodes: 2, my index: 0, protocol version 2
120726 16:11:38 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
On the node that hard-crashes:
120726 16:10:34 mysqld_safe Number of processes running now: 0
120726 16:10:34 mysqld_safe mysqld restarted
120726 16:10:35 [Note] Flashcache bypass: disabled
120726 16:10:35 [Note] Flashcache setup error is : ioctl failed
120726 16:10:35 [Warning] You need to use --log-bin to make --log-slave-updates work.
120726 16:10:35 [Note] WSREP: Read nil XID from storage engines, skipping position init
120726 16:10:35 [Note] WSREP: wsrep_load(): loading provider library ‘/usr/lib64/libgalera_smm.so’
120726 16:10:36 [Note] WSREP: wsrep_load(): Galera 2.1(r113) by Codership Oy <info@codership.com> loaded succesfully.
120726 16:10:36 [Note] WSREP: Found saved state: 97e9eadb-d1fa-11e1-0800-d054b2ba0044:-1
120726 16:10:36 [Note] WSREP: Reusing existing ‘/var/lib/mysql//galera.cache’.
120726 16:10:36 [Note] WSREP: Passing config to GCS: base_host = 172.30.0.163; evs.consensus_timeout = PT1M; evs.inactive_check_period = PT10S; evs.inactive_timeout = PT1M; evs.keepalive_period = PT3S; evs.send_window = 1024; evs.suspect_timeout = PT30S; evs.user_send_window = 512; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 1G; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
120726 16:10:36 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
120726 16:10:36 [Note] WSREP: wsrep_sst_grab()
120726 16:10:36 [Note] WSREP: Start replication
120726 16:10:36 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
120726 16:10:36 [Note] WSREP: (1bc3616b-d777-11e1-0800-4a369f7eed28, ‘tcp://0.0.0.0:4567’) listening at tcp://0.0.0.0:4567
120726 16:10:36 [Note] WSREP: (1bc3616b-d777-11e1-0800-4a369f7eed28, ‘tcp://0.0.0.0:4567’) multicast: , ttl: 1
120726 16:10:36 [Note] WSREP: EVS version 0
120726 16:10:36 [Note] WSREP: PC version 0
120726 16:10:36 [Note] WSREP: gcomm: connecting to group ‘galeraprimary’, peer ‘galera1.torreycommerce.net:4567’
120726 16:10:36 [Note] WSREP: (1bc3616b-d777-11e1-0800-4a369f7eed28, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://172.30.0.154:4567
120726 16:10:36 [Note] WSREP: (1bc3616b-d777-11e1-0800-4a369f7eed28, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
120726 16:11:06 [Note] WSREP: view((empty))
120726 16:11:06 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():148
120726 16:11:06 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -110 (Connection timed out)
120726 16:11:06 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel ‘galeraprimary’ at ‘gcomm://galera1.torreycommerce.net:4567’: -110 (Connection timed out)
120726 16:11:06 [ERROR] WSREP: gcs connect failed: Connection timed out
120726 16:11:06 [ERROR] WSREP: wsrep::connect() failed: 6
120726 16:11:06 [ERROR] Aborting
120726 16:11:06 [Note] WSREP: Service disconnected.
120726 16:11:07 [Note] WSREP: Some threads may fail to exit.
Any ideas what is going on here?