Not the answer you need?
Register and ask your own question!

All nodes in the cluster becomes inaccessible

sojaducanasojaducana EntrantCurrent User Role Beginner
It has now happened twice.

Nodes in the cluster become inaccessible (via MySQL), hours from each other, which leaves the whole cluster in an inaccessible state. We're using version 5.6.

When I checked the states of each node, found out that (node 2 no longer in service):
- node 1 still has the mysql service running
- node 3 mysql service no longer running
- node 4 mysql service no longer running

The following are the logs from each of the nodes.

Node1
2014-05-15 02:51:44 25544 [Note] WSREP: (da81411f-d34b-11e3-b420-c65dd499bd5c, 'tcp://0.0.0.0:4567') reconnecting to eb65461e-d34b-11e3-b8e3-16c13b906ff5 (tcp://192.168.0.62:4567), attempt 112080
2014-05-15 02:52:29 25544 [Note] WSREP: (da81411f-d34b-11e3-b420-c65dd499bd5c, 'tcp://0.0.0.0:4567') reconnecting to eb65461e-d34b-11e3-b8e3-16c13b906ff5 (tcp://192.168.0.62:4567), attempt 112110



Node3
2014-05-13 01:41:36 29139 [Warning] WSREP: SYNC message from member 0 in non-primary configuration. Ignored.
2014-05-13 01:41:36 29139 [Note] WSREP: STATE EXCHANGE: sent state msg: 71ba447f-da7a-11e3-8952-f2887c244e09
2014-05-13 01:41:36 29139 [Note] WSREP: STATE EXCHANGE: got state msg: 71ba447f-da7a-11e3-8952-f2887c244e09 from 0 (node1)
2014-05-13 01:41:36 29139 [Note] WSREP: STATE EXCHANGE: got state msg: 71ba447f-da7a-11e3-8952-f2887c244e09 from 1 (node3)
2014-05-13 01:41:36 29139 [Note] WSREP: Quorum results:
    version    = 3,
    component  = PRIMARY,
    conf_id    = 84,
    members    = 2/2 (joined/total),
    act_id     = 1451246,
    last_appl. = 1451221,
    protocols  = 0/5/2 (gcs/repl/appl),
    group UUID = da819e4a-d34b-11e3-bb14-4a770834cff0
2014-05-13 01:41:36 29139 [Note] WSREP: Flow-control interval: [23, 23]
2014-05-13 01:41:36 29139 [Note] WSREP: New cluster view: global state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451246, view# 85: Primary, number of nodes: 2, my index: 1, protocol version 2
2014-05-13 01:41:36 29139 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-13 01:41:36 29139 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-05-13 01:41:36 29139 [Note] WSREP: Assign initial position for certification: 1451246, protocol version: 3
2014-05-13 01:41:36 29139 [Note] WSREP: Member 0 (node1) synced with group.
2014-05-13 01:41:36 29139 [Note] WSREP: Service thread queue flushed.
2014-05-13 01:41:41 29139 [Note] WSREP:  cleaning up 7ecbfb50-d356-11e3-bc64-ea7ee5869f45 (tcp://192.168.0.63:4567)
2014-05-13 03:41:50 29139 [ERROR] /usr/sbin/mysqld: Sort aborted: Query execution was interrupted
2014-05-13 03:41:50 29139 [ERROR] /usr/sbin/mysqld: Sort aborted: Query execution was interrupted
2014-05-13 03:43:38 29139 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.52864S), skipping check
2014-05-13 03:50:40 29139 [ERROR] /usr/sbin/mysqld: Sort aborted: Query execution was interrupted
Killed
140513 04:09:47 mysqld_safe Number of processes running now: 0
140513 04:09:47 mysqld_safe WSREP: not restarting wsrep node automatically
140513 04:09:47 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended


Node4

2014-05-13 01:41:27 32604 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2014-05-13 01:41:27 32604 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 5f6aa149-da7a-11e3-9e55-47df500e7415
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: sent state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: got state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415 from 0 (node4)
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: got state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415 from 1 (node1)
2014-05-13 01:41:27 32604 [Note] WSREP: STATE EXCHANGE: got state msg: 5f6aa149-da7a-11e3-9e55-47df500e7415 from 2 (node3)
2014-05-13 01:41:27 32604 [Note] WSREP: Quorum results:
     version    = 3,
     component  = PRIMARY,
     conf_id    = 83,
     members    = 2/3 (joined/total),
     act_id     = 1451231,
     last_appl. = 1451210,
     protocols  = 0/5/2 (gcs/repl/appl),
     group UUID = da819e4a-d34b-11e3-bb14-4a770834cff0
2014-05-13 01:41:27 32604 [Note] WSREP: Flow-control interval: [28, 28]
2014-05-13 01:41:27 32604 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1451231)
2014-05-13 01:41:27 32604 [Note] WSREP: State transfer required:
     Group state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451231
     Local state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451222
2014-05-13 01:41:27 32604 [Note] WSREP: New cluster view: global state: da819e4a-d34b-11e3-bb14-4a770834cff0:1451231, view# 84: Primary, number of nodes: 3, my index: 0, protocol version 2
2014-05-13 01:41:27 32604 [Warning] WSREP: Gap in state sequence. Need state transfer.
2014-05-13 01:41:29 32604 [Note] WSREP: killing local connection: 166283
2014-05-13 01:41:29 32604 [ERROR] WSREP: Node IP address not obtained from bind_address, trying alternate methods
2014-05-13 01:41:29 32604 [Note] WSREP: You have configured 'rsync' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
2014-05-13 01:41:29 32604 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-13 01:41:29 32604 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-05-13 01:41:29 32604 [Note] WSREP: Assign initial position for certification: 1451231, protocol version: 3
2014-05-13 01:41:29 32604 [Note] WSREP: Service thread queue flushed.
2014-05-13 01:41:29 32604 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.0.63:4568
2014-05-13 01:41:29 32604 [Note] WSREP: Node 0.0 (node4) requested state transfer from '*any*'. Selected 1.0 (node1)(SYNCED) as donor.
2014-05-13 01:41:29 32604 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1451238)
2014-05-13 01:41:29 32604 [Note] WSREP: Requesting state transfer: success, donor: 1
2014-05-13 01:41:29 32604 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 85194386 bytes
2014-05-13 01:41:29 32604 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 0/7 locked buffers
2014-05-13 01:41:29 32604 [Note] WSREP: Receiving IST: 9 writesets, seqnos 1451222-1451231
mysqld: /mnt/workspace/percona-xtradb-cluster-5.6-debs/label_exp/ubuntu-precise-64bit/target/Percona-XtraDB-Cluster-5.6.15/sql/wsrep_applier.cc:321: wsrep_cb_status_t wsrep_commit_cb(void*, uint32_t, const wsrep_trx_meta_t*, wsrep_bool_t*, bool): Assertion `meta->gtid.seqno == wsrep_thd_trx_seqno(thd)' failed.
08:41:32 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=325
max_threads=502
thread_count=9
connection_count=9
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 208471 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f01c0000990

stack_bottom = 7f023c2b9a50 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x8ea20e]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x68d104]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f028290fcb0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f0281d69445]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7f0281d6cbab]
/lib/x86_64-linux-gnu/libc.so.6(+0x2f10e)[0x7f0281d6210e]
/lib/x86_64-linux-gnu/libc.so.6(+0x2f1b2)[0x7f0281d621b2]
/usr/sbin/mysqld[0x5d81ab]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM9apply_trxEPvPNS_9TrxHandleE+0x152)[0x7f02621b43b2]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM8recv_ISTEPv+0x294)[0x7f02621c1e04]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x332)[0x7f02621b6682]
/usr/lib/libgalera_smm.so(galera_recv+0x1e)[0x7f02621c8bbe]
/usr/sbin/mysqld[0x5d83d1]
/usr/sbin/mysqld(start_wsrep_THD+0x41d)[0x5c0f5d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f0282907e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f0281e254bd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 5
Status: NOT_KILLED

140513 01:41:32 mysqld_safe Number of processes running now: 0
140513 01:41:32 mysqld_safe WSREP: not restarting wsrep node automatically
140513 01:41:32 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended



From what I understand from the logs, seems like Node4 crashed while trying to sync with the group, and indicates it could possibly be a bug. Then with Node3, crashed a few hours after syncing with Node1. Could the fact that there were only two remaining nodes running caused this?

MySQL service in Node1 was still running, but was trying to connect to Node4, and could not be accessed. Why? Why wasn't it able to stand on its own?


Here's the cluster address configuration for each of the nodes.

wsrep_cluster_address          = "gcomm://192.168.0.60,192.168.0.62,192.168.0.63"



Any solution on how we can prevent this from happening again?

Thanks!

Comments

  • sojaducanasojaducana Entrant Current User Role Beginner
    Encountered another crash on Node4 just now:
    2014-05-16 05:31:11 21958 [Warning] WSREP: Gap in state sequence. Need state transfer.
    2014-05-16 05:31:11 21958 [ERROR] /usr/sbin/mysqld: Sort aborted: Query execution was interrupted
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1340
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1134
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 25
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1173
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1343
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1101
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1268
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1070
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 983
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1041
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1385
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1238
    2014-05-16 05:31:13 21958 [Note] WSREP: killing local connection: 1215
    2014-05-16 05:31:13 21958 [ERROR] WSREP: Node IP address not obtained from bind_address, trying alternate methods
    2014-05-16 05:31:14 21958 [Note] WSREP: You have configured 'rsync' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
    2014-05-16 05:31:14 21958 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    2014-05-16 05:31:14 21958 [Note] WSREP: REPL Protocols: 5 (3, 1)
    2014-05-16 05:31:14 21958 [Note] WSREP: Assign initial position for certification: 1614379, protocol version: 3
    2014-05-16 05:31:14 21958 [Note] WSREP: Service thread queue flushed.
    2014-05-16 05:31:14 21958 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.0.63:4568
    2014-05-16 05:31:14 21958 [Note] WSREP: Node 1.0 (node4) requested state transfer from '*any*'. Selected 0.0 (node3)(SYNCED) as donor.
    2014-05-16 05:31:14 21958 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1614386)
    2014-05-16 05:31:14 21958 [Note] WSREP: Requesting state transfer: success, donor: 0
    2014-05-16 05:31:14 21958 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 109701967 bytes
    2014-05-16 05:31:14 21958 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 0/29 locked buffers
    2014-05-16 05:31:14 21958 [Note] WSREP: Receiving IST: 46 writesets, seqnos 1614333-1614379
    mysqld: /mnt/workspace/percona-xtradb-cluster-5.6-debs/label_exp/ubuntu-precise-64bit/target/Percona-XtraDB-Cluster-5.6.15/sql/wsrep_applier.cc:321: wsrep_cb_status_t wsrep_commit_cb(void*, uint32_t, const wsrep_trx_meta_t*, wsrep_bool_t*, bool): Assertion `meta->gtid.seqno == wsrep_thd_trx_seqno(thd)' failed.
    12:31:14 UTC - mysqld got signal 6 ;
    This could be because you hit a bug. It is also possible that this binary
    or one of the libraries it was linked against is corrupt, improperly built,
    or misconfigured. This error can also be caused by malfunctioning hardware.
    We will try our best to scrape up some info that will hopefully help
    diagnose the problem, but since we have already crashed, 
    something is definitely wrong and this may fail.
    Please help us make Percona XtraDB Cluster better by reporting any
    bugs at https://bugs.launchpad.net/percona-xtradb-cluster
    
    key_buffer_size=8388608
    read_buffer_size=131072
    max_used_connections=36
    max_threads=502
    thread_count=9
    connection_count=9
    It is possible that mysqld could use up to 
    key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 208471 K  bytes of memory
    Hope that's ok; if not, decrease some variables in the equation.
    
    Thread pointer: 0x7f6ca8000990
    Attempting backtrace. You can use the following information to find out
    where mysqld died. If you see no messages after this, something went
    terribly wrong...
    stack_bottom = 7f6d54631a50 thread_stack 0x40000
    /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x8ea20e]
    /usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x68d104]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f6d82382cb0]
    /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f6d817dc445]
    /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7f6d817dfbab]
    /lib/x86_64-linux-gnu/libc.so.6(+0x2f10e)[0x7f6d817d510e]
    /lib/x86_64-linux-gnu/libc.so.6(+0x2f1b2)[0x7f6d817d51b2]
    /usr/sbin/mysqld[0x5d81ab]
    /usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM9apply_trxEPvPNS_9TrxHandleE+0x152)[0x7f6d61bcf3b2]
    /usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM8recv_ISTEPv+0x294)[0x7f6d61bdce04]
    /usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x332)[0x7f6d61bd1682]
    /usr/lib/libgalera_smm.so(galera_recv+0x1e)[0x7f6d61be3bbe]
    /usr/sbin/mysqld[0x5d83d1]
    /usr/sbin/mysqld(start_wsrep_THD+0x41d)[0x5c0f5d]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f6d8237ae9a]
    /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6d818984bd]
    
    Trying to get some variables.
    Some pointers may be invalid and cause the dump to abort.
    Query (0): is an invalid pointer
    Connection ID (thread ID): 8
    Status: NOT_KILLED
    
    You may download the Percona XtraDB Cluster operations manual by visiting
    http://www.percona.com/software/percona-xtradb-cluster/. You may find information
    in the manual which will help you identify the cause of the crash.
    140516 05:31:15 mysqld_safe Number of processes running now: 0
    140516 05:31:15 mysqld_safe WSREP: not restarting wsrep node automatically
    140516 05:31:15 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
    
  • emakonemakon Entrant Inactive User Role Beginner
    The line "/usr/sbin/mysqld: Sort aborted: Query execution was interrupted" indicates, that there is a problem with one query. Perhaps see here: http://stackoverflow.com/questions/5165112/mysql-error-sort-aborted
  • przemekprzemek Percona Support Engineer Percona Staff Role
    Sojaducana, it seems you've hit this bug:
    https://bugs.launchpad.net/galera/+bug/1284803
    I was already fixed in Galera 3.5:
    https://launchpad.net/percona-xtradb-cluster/+milestone/galera-3.5
    So please upgrade the PXC packages.
  • sojaducanasojaducana Entrant Current User Role Beginner
    Hi przemek,

    How will I know the specific package version of Galera our system is running on?

    Edit:

    Nevermind, found the info the I need by:
    dpkg -s percona-xtradb-cluster-galera-3.x
    
  • sojaducanasojaducana Entrant Current User Role Beginner
    Hi guys,

    I haven't encountered (for some time now) the same issues above (as far as the logs are concerned) after the packages were updated.

    But earlier today, the cluster went down again, where all of the nodes were reportedly inaccessible by our proxy. Upon checking on each of the nodes, found out that the MySQL services were still running.

    From what I minimally understand in the logs, it seems that the nodes suddenly couldn't see each other.

    Please help in understanding what exactly is happening here.

    Node 1
    [code]
  • sojaducanasojaducana Entrant Current User Role Beginner
    Node 3 (Part 1)
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
  • sojaducanasojaducana Entrant Current User Role Beginner
    Node 3 (part 2)

    [code]
  • sojaducanasojaducana Entrant Current User Role Beginner
    Node 4

    [code]
  • przemekprzemek Percona Support Engineer Percona Staff Role
    Did you actually check the network connectivity between the nodes once this happened? Are those nodes a separate machines or mabe VMs, possibly sharing the same host machine, or storage?
Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.