All nodes in the cluster becomes inaccessible

It has now happened twice.

Nodes in the cluster become inaccessible (via MySQL), hours from each other, which leaves the whole cluster in an inaccessible state. We’re using version 5.6.

When I checked the states of each node, found out that (node 2 no longer in service):

  • node 1 still has the mysql service running
  • node 3 mysql service no longer running
  • node 4 mysql service no longer running

The following are the logs from each of the nodes.

Node1


2014-05-15 02:51:44 25544 [Note] WSREP: (da81411f-d34b-11e3-b420-c65dd499bd5c, 'tcp://0.0.0.0:4567') reconnecting to eb65461e-d34b-11e3-b8e3-16c13b906ff5 (tcp://192.168.0.62:4567), attempt 112080
2014-05-15 02:52:29 25544 [Note] WSREP: (da81411f-d34b-11e3-b420-c65dd499bd5c, 'tcp://0.0.0.0:4567') reconnecting to eb65461e-d34b-11e3-b8e3-16c13b906ff5 (tcp://192.168.0.62:4567), attempt 112110

Node3

2014-05-13 01:41:36 29139 [Warning] WSREP: SYNC message from member 0 in non-primary configuration. Ignored.
2014-05-13 01:41:36 29139 [Note] WSREP: STATE EXCHANGE: sent state msg: 71ba447f-da7a-11e3-8952-f2887c244e09
2014-05-13 01:41:36 29139 [Note] WSREP: STATE EXCHANGE: got state msg: 71ba447f-da7a-11e3-8952-f2887c244e09 from 0 (node1)
2014-05-13 01:41:36 29139 [Note] WSREP: STATE EXCHANGE: got state msg: 71ba447f-da7a-11e3-8952-f2887c244e09 from 1 (node3)
2014-05-13 01:41:36 29139 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 84,
members = 2/2 (joined/total),
act_id = 1451246,
last_appl. = 1451221,
protocols = 0/5/2 (gcs/repl/appl),
group UUID = da819e4a-d34b-11e3-bb14-4a770834cff0
2014-05-13 01:41:36 29139 [Note] WSREP: Flow-control interval: [23, 23]
2014-05-13 01:41:36 29139 [Note] WSREP: New cluster view: global state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451246, view# 85: Primary, number of nodes: 2, my index: 1, protocol version 2
2014-05-13 01:41:36 29139 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-13 01:41:36 29139 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-05-13 01:41:36 29139 [Note] WSREP: Assign initial position for certification: 1451246, protocol version: 3
2014-05-13 01:41:36 29139 [Note] WSREP: Member 0 (node1) synced with group.
2014-05-13 01:41:36 29139 [Note] WSREP: Service thread queue flushed.
2014-05-13 01:41:41 29139 [Note] WSREP: cleaning up 7ecbfb50-d356-11e3-bc64-ea7ee5869f45 (tcp://192.168.0.63:4567)
2014-05-13 03:41:50 29139 [ERROR] /usr/sbin/mysqld: Sort aborted: Query execution was interrupted
2014-05-13 03:41:50 29139 [ERROR] /usr/sbin/mysqld: Sort aborted: Query execution was interrupted
2014-05-13 03:43:38 29139 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.52864S), skipping check
2014-05-13 03:50:40 29139 [ERROR] /usr/sbin/mysqld: Sort aborted: Query execution was interrupted
Killed
140513 04:09:47 mysqld_safe Number of processes running now: 0
140513 04:09:47 mysqld_safe WSREP: not restarting wsrep node automatically
140513 04:09:47 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

Node4


2014-05-13 01:41:27 32604 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2014-05-13 01:41:27 32604 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 5f6aa149-da7a-11e3-9e55-47df500e7415
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: sent state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: got state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415 from 0 (node4)
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: got state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415 from 1 (node1)
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: got state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415 from 2 (node3)
2014-05-13 01:41:27 32604 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 83,
members = 2/3 (joined/total),
act_id = 1451231,
last_appl. = 1451210,
protocols = 0/5/2 (gcs/repl/appl),
group UUID = da819e4a-d34b-11e3-bb14-4a770834cff0
2014-05-13 01:41:27 32604 [Note] WSREP: Flow-control interval: [28, 28]
2014-05-13 01:41:27 32604 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1451231)
2014-05-13 01:41:27 32604 [Note] WSREP: State transfer required:
Group state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451231
Local state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451222
2014-05-13 01:41:27 32604 [Note] WSREP: New cluster view: global state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451231, view# 84: Primary, number of nodes: 3, my index: 0, protocol version 2
2014-05-13 01:41:27 32604 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-05-13 01:41:29 32604 [Note] WSREP: killing local connection: 166283
2014-05-13 01:41:29 32604 [ERROR] WSREP: Node IP address not obtained from bind_address, trying alternate methods
2014-05-13 01:41:29 32604 [Note] WSREP: You have configured 'rsync' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
2014-05-13 01:41:29 32604 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-13 01:41:29 32604 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-05-13 01:41:29 32604 [Note] WSREP: Assign initial position for certification: 1451231, protocol version: 3
2014-05-13 01:41:29 32604 [Note] WSREP: Service thread queue flushed.
2014-05-13 01:41:29 32604 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.0.63:4568
2014-05-13 01:41:29 32604 [Note] WSREP: Node 0.0 (node4) requested state transfer from '*any*'. Selected 1.0 (node1)(SYNCED) as donor.
2014-05-13 01:41:29 32604 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1451238)
2014-05-13 01:41:29 32604 [Note] WSREP: Requesting state transfer: success, donor: 1
2014-05-13 01:41:29 32604 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 85194386 bytes
2014-05-13 01:41:29 32604 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 0/7 locked buffers
2014-05-13 01:41:29 32604 [Note] WSREP: Receiving IST: 9 writesets, seqnos 1451222-1451231
mysqld: /mnt/workspace/percona-xtradb-cluster-5.6-debs/label_exp/ubuntu-precise-64bit/target/Percona-XtraDB-Cluster-5.6.15/sql/wsrep_applier.cc:321: wsrep_cb_status_t wsrep_commit_cb(void*, uint32_t, const wsrep_trx_meta_t*, wsrep_bool_t*, bool): Assertion `meta->gtid.seqno == wsrep_thd_trx_seqno(thd)' failed.
08:41:32 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=325
max_threads=502
thread_count=9
connection_count=9
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 208471 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f01c0000990

stack_bottom = 7f023c2b9a50 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x8ea20e]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x68d104]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f028290fcb0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f0281d69445]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7f0281d6cbab]
/lib/x86_64-linux-gnu/libc.so.6(+0x2f10e)[0x7f0281d6210e]
/lib/x86_64-linux-gnu/libc.so.6(+0x2f1b2)[0x7f0281d621b2]
/usr/sbin/mysqld[0x5d81ab]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM9apply_trxEPvPNS_9TrxHandleE+0x152)[0x7f02621b43b2]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM8recv_ISTEPv+0x294)[0x7f02621c1e04]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x332)[0x7f02621b6682]
/usr/lib/libgalera_smm.so(galera_recv+0x1e)[0x7f02621c8bbe]
/usr/sbin/mysqld[0x5d83d1]
/usr/sbin/mysqld(start_wsrep_THD+0x41d)[0x5c0f5d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f0282907e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f0281e254bd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 5
Status: NOT_KILLED

140513 01:41:32 mysqld_safe Number of processes running now: 0
140513 01:41:32 mysqld_safe WSREP: not restarting wsrep node automatically
140513 01:41:32 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

From what I understand from the logs, seems like Node4 crashed while trying to sync with the group, and indicates it could possibly be a bug. Then with Node3, crashed a few hours after syncing with Node1. Could the fact that there were only two remaining nodes running caused this?

MySQL service in Node1 was still running, but was trying to connect to Node4, and could not be accessed. Why? Why wasn’t it able to stand on its own?

Here’s the cluster address configuration for each of the nodes.

wsrep_cluster_address = "gcomm://192.168.0.60,192.168.0.62,192.168.0.63"

Any solution on how we can prevent this from happening again?

Thanks!

Encountered another crash on Node4 just now:

2014-05-16 05:31:11 21958 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-05-16 05:31:11 21958 [ERROR] /usr/sbin/mysqld: Sort aborted: Query execution was interrupted
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1340
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1134
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 25
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1173
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1343
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1101
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1268
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1070
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 983
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1041
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1385
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1238
2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1215
2014-05-16 05:31:13 21958 [ERROR] WSREP: Node IP address not obtained from bind_address, trying alternate methods
2014-05-16 05:31:14 21958 [Note] WSREP: You have configured 'rsync' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
2014-05-16 05:31:14 21958 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-16 05:31:14 21958 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-05-16 05:31:14 21958 [Note] WSREP: Assign initial position for certification: 1614379, protocol version: 3
2014-05-16 05:31:14 21958 [Note] WSREP: Service thread queue flushed.
2014-05-16 05:31:14 21958 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.0.63:4568
2014-05-16 05:31:14 21958 [Note] WSREP: Node 1.0 (node4) requested state transfer from '*any*'. Selected 0.0 (node3)(SYNCED) as donor.
2014-05-16 05:31:14 21958 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1614386)
2014-05-16 05:31:14 21958 [Note] WSREP: Requesting state transfer: success, donor: 0
2014-05-16 05:31:14 21958 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 109701967 bytes
2014-05-16 05:31:14 21958 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 0/29 locked buffers
2014-05-16 05:31:14 21958 [Note] WSREP: Receiving IST: 46 writesets, seqnos 1614333-1614379
mysqld: /mnt/workspace/percona-xtradb-cluster-5.6-debs/label_exp/ubuntu-precise-64bit/target/Percona-XtraDB-Cluster-5.6.15/sql/wsrep_applier.cc:321: wsrep_cb_status_t wsrep_commit_cb(void*, uint32_t, const wsrep_trx_meta_t*, wsrep_bool_t*, bool): Assertion `meta->gtid.seqno == wsrep_thd_trx_seqno(thd)' failed.
12:31:14 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=36
max_threads=502
thread_count=9
connection_count=9
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 208471 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f6ca8000990
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f6d54631a50 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x8ea20e]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x68d104]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f6d82382cb0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f6d817dc445]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7f6d817dfbab]
/lib/x86_64-linux-gnu/libc.so.6(+0x2f10e)[0x7f6d817d510e]
/lib/x86_64-linux-gnu/libc.so.6(+0x2f1b2)[0x7f6d817d51b2]
/usr/sbin/mysqld[0x5d81ab]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM9apply_trxEPvPNS_9TrxHandleE+0x152)[0x7f6d61bcf3b2]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM8recv_ISTEPv+0x294)[0x7f6d61bdce04]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x332)[0x7f6d61bd1682]
/usr/lib/libgalera_smm.so(galera_recv+0x1e)[0x7f6d61be3bbe]
/usr/sbin/mysqld[0x5d83d1]
/usr/sbin/mysqld(start_wsrep_THD+0x41d)[0x5c0f5d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f6d8237ae9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6d818984bd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 8
Status: NOT_KILLED

You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
140516 05:31:15 mysqld_safe Number of processes running now: 0
140516 05:31:15 mysqld_safe WSREP: not restarting wsrep node automatically
140516 05:31:15 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

The line “/usr/sbin/mysqld: Sort aborted: Query execution was interrupted” indicates, that there is a problem with one query. Perhaps see here: [url]MySQL Error: Sort aborted - Stack Overflow

Sojaducana, it seems you’ve hit this bug:
[url]https://bugs.launchpad.net/galera/+bug/1284803[/url]
I was already fixed in Galera 3.5:
[url]https://launchpad.net/percona-xtradb-cluster/+milestone/galera-3.5[/url]
So please upgrade the PXC packages.

Hi przemek,

How will I know the specific package version of Galera our system is running on?

Edit:

Nevermind, found the info the I need by:

dpkg -s percona-xtradb-cluster-galera-3.x

Hi guys,

I haven’t encountered (for some time now) the same issues above (as far as the logs are concerned) after the packages were updated.

But earlier today, the cluster went down again, where all of the nodes were reportedly inaccessible by our proxy. Upon checking on each of the nodes, found out that the MySQL services were still running.

From what I minimally understand in the logs, it seems that the nodes suddenly couldn’t see each other.

Please help in understanding what exactly is happening here.

Node 1

Node 3 (Part 1)
































































Node 3 (part 2)

Node 4

Did you actually check the network connectivity between the nodes once this happened? Are those nodes a separate machines or mabe VMs, possibly sharing the same host machine, or storage?