It has now happened twice.
Nodes in the cluster become inaccessible (via MySQL), hours from each other, which leaves the whole cluster in an inaccessible state. We’re using version 5.6.
When I checked the states of each node, found out that (node 2 no longer in service):
- node 1 still has the mysql service running
- node 3 mysql service no longer running
- node 4 mysql service no longer running
The following are the logs from each of the nodes.
Node1
2014-05-15 02:51:44 25544 [Note] WSREP: (da81411f-d34b-11e3-b420-c65dd499bd5c, 'tcp://0.0.0.0:4567') reconnecting to eb65461e-d34b-11e3-b8e3-16c13b906ff5 (tcp://192.168.0.62:4567), attempt 112080
2014-05-15 02:52:29 25544 [Note] WSREP: (da81411f-d34b-11e3-b420-c65dd499bd5c, 'tcp://0.0.0.0:4567') reconnecting to eb65461e-d34b-11e3-b8e3-16c13b906ff5 (tcp://192.168.0.62:4567), attempt 112110
Node3
2014-05-13 01:41:36 29139 [Warning] WSREP: SYNC message from member 0 in non-primary configuration. Ignored.
2014-05-13 01:41:36 29139 [Note] WSREP: STATE EXCHANGE: sent state msg: 71ba447f-da7a-11e3-8952-f2887c244e09
2014-05-13 01:41:36 29139 [Note] WSREP: STATE EXCHANGE: got state msg: 71ba447f-da7a-11e3-8952-f2887c244e09 from 0 (node1)
2014-05-13 01:41:36 29139 [Note] WSREP: STATE EXCHANGE: got state msg: 71ba447f-da7a-11e3-8952-f2887c244e09 from 1 (node3)
2014-05-13 01:41:36 29139 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 84,
members = 2/2 (joined/total),
act_id = 1451246,
last_appl. = 1451221,
protocols = 0/5/2 (gcs/repl/appl),
group UUID = da819e4a-d34b-11e3-bb14-4a770834cff0
2014-05-13 01:41:36 29139 [Note] WSREP: Flow-control interval: [23, 23]
2014-05-13 01:41:36 29139 [Note] WSREP: New cluster view: global state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451246, view# 85: Primary, number of nodes: 2, my index: 1, protocol version 2
2014-05-13 01:41:36 29139 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-13 01:41:36 29139 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-05-13 01:41:36 29139 [Note] WSREP: Assign initial position for certification: 1451246, protocol version: 3
2014-05-13 01:41:36 29139 [Note] WSREP: Member 0 (node1) synced with group.
2014-05-13 01:41:36 29139 [Note] WSREP: Service thread queue flushed.
2014-05-13 01:41:41 29139 [Note] WSREP: cleaning up 7ecbfb50-d356-11e3-bc64-ea7ee5869f45 (tcp://192.168.0.63:4567)
2014-05-13 03:41:50 29139 [ERROR] /usr/sbin/mysqld: Sort aborted: Query execution was interrupted
2014-05-13 03:41:50 29139 [ERROR] /usr/sbin/mysqld: Sort aborted: Query execution was interrupted
2014-05-13 03:43:38 29139 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.52864S), skipping check
2014-05-13 03:50:40 29139 [ERROR] /usr/sbin/mysqld: Sort aborted: Query execution was interrupted
Killed
140513 04:09:47 mysqld_safe Number of processes running now: 0
140513 04:09:47 mysqld_safe WSREP: not restarting wsrep node automatically
140513 04:09:47 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
Node4
2014-05-13 01:41:27 32604 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2014-05-13 01:41:27 32604 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 5f6aa149-da7a-11e3-9e55-47df500e7415
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: sent state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: got state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415 from 0 (node4)
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: got state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415 from 1 (node1)
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: got state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415 from 2 (node3)
2014-05-13 01:41:27 32604 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 83,
members = 2/3 (joined/total),
act_id = 1451231,
last_appl. = 1451210,
protocols = 0/5/2 (gcs/repl/appl),
group UUID = da819e4a-d34b-11e3-bb14-4a770834cff0
2014-05-13 01:41:27 32604 [Note] WSREP: Flow-control interval: [28, 28]
2014-05-13 01:41:27 32604 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1451231)
2014-05-13 01:41:27 32604 [Note] WSREP: State transfer required:
Group state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451231
Local state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451222
2014-05-13 01:41:27 32604 [Note] WSREP: New cluster view: global state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451231, view# 84: Primary, number of nodes: 3, my index: 0, protocol version 2
2014-05-13 01:41:27 32604 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-05-13 01:41:29 32604 [Note] WSREP: killing local connection: 166283
2014-05-13 01:41:29 32604 [ERROR] WSREP: Node IP address not obtained from bind_address, trying alternate methods
2014-05-13 01:41:29 32604 [Note] WSREP: You have configured 'rsync' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
2014-05-13 01:41:29 32604 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-13 01:41:29 32604 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-05-13 01:41:29 32604 [Note] WSREP: Assign initial position for certification: 1451231, protocol version: 3
2014-05-13 01:41:29 32604 [Note] WSREP: Service thread queue flushed.
2014-05-13 01:41:29 32604 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.0.63:4568
2014-05-13 01:41:29 32604 [Note] WSREP: Node 0.0 (node4) requested state transfer from '*any*'. Selected 1.0 (node1)(SYNCED) as donor.
2014-05-13 01:41:29 32604 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1451238)
2014-05-13 01:41:29 32604 [Note] WSREP: Requesting state transfer: success, donor: 1
2014-05-13 01:41:29 32604 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 85194386 bytes
2014-05-13 01:41:29 32604 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 0/7 locked buffers
2014-05-13 01:41:29 32604 [Note] WSREP: Receiving IST: 9 writesets, seqnos 1451222-1451231
mysqld: /mnt/workspace/percona-xtradb-cluster-5.6-debs/label_exp/ubuntu-precise-64bit/target/Percona-XtraDB-Cluster-5.6.15/sql/wsrep_applier.cc:321: wsrep_cb_status_t wsrep_commit_cb(void*, uint32_t, const wsrep_trx_meta_t*, wsrep_bool_t*, bool): Assertion `meta->gtid.seqno == wsrep_thd_trx_seqno(thd)' failed.
08:41:32 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster
key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=325
max_threads=502
thread_count=9
connection_count=9
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 208471 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f01c0000990
stack_bottom = 7f023c2b9a50 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x8ea20e]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x68d104]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f028290fcb0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f0281d69445]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7f0281d6cbab]
/lib/x86_64-linux-gnu/libc.so.6(+0x2f10e)[0x7f0281d6210e]
/lib/x86_64-linux-gnu/libc.so.6(+0x2f1b2)[0x7f0281d621b2]
/usr/sbin/mysqld[0x5d81ab]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM9apply_trxEPvPNS_9TrxHandleE+0x152)[0x7f02621b43b2]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM8recv_ISTEPv+0x294)[0x7f02621c1e04]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x332)[0x7f02621b6682]
/usr/lib/libgalera_smm.so(galera_recv+0x1e)[0x7f02621c8bbe]
/usr/sbin/mysqld[0x5d83d1]
/usr/sbin/mysqld(start_wsrep_THD+0x41d)[0x5c0f5d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f0282907e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f0281e254bd]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 5
Status: NOT_KILLED
140513 01:41:32 mysqld_safe Number of processes running now: 0
140513 01:41:32 mysqld_safe WSREP: not restarting wsrep node automatically
140513 01:41:32 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
From what I understand from the logs, seems like Node4 crashed while trying to sync with the group, and indicates it could possibly be a bug. Then with Node3, crashed a few hours after syncing with Node1. Could the fact that there were only two remaining nodes running caused this?
MySQL service in Node1 was still running, but was trying to connect to Node4, and could not be accessed. Why? Why wasn’t it able to stand on its own?
Here’s the cluster address configuration for each of the nodes.
wsrep_cluster_address = "gcomm://192.168.0.60,192.168.0.62,192.168.0.63"
Any solution on how we can prevent this from happening again?
Thanks!