Percona XtraDB Cluster Node Crashing Once a Day

elijahpaul · June 4, 2013, 8:59pm

I have a 3 node Percona cluster setup on EC2 instances.

Node 1 of the cluster (orginaly the primary node) is now failing once a day, and I can’t figure out from the error log what is causing it.

Log exert:

130604 7:32:18 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
130604 7:32:18 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130604 7:32:18 [Note] WSREP: STATE EXCHANGE: sent state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46
130604 7:32:18 [Note] WSREP: STATE EXCHANGE: got state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46 from 0 (node2)
130604 7:32:18 [Note] WSREP: STATE EXCHANGE: got state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46 from 2 (node1)
130604 7:32:18 [Note] WSREP: STATE EXCHANGE: got state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46 from 1 (node3)
130604 7:32:18 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 359,
members = 3/3 (joined/total),
act_id = 433339,
last_appl. = 0,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = e38d9bfc-7d4e-11e2-0800-9446319d18cf
130604 7:32:18 [Note] WSREP: Flow-control interval: [28, 28]
130604 7:32:18 [Note] WSREP: Restored state OPEN -> SYNCED (433339)
130604 7:32:18 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433339, view# 360: Primary, number of nodes: 3, my index: 2, protocol version 2
130604 7:32:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 7:32:18 [Note] WSREP: Assign initial position for certification: 433339, protocol version: 2
130604 7:32:18 [Note] WSREP: Synchronized with group, ready for connections
130604 7:32:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 7:36:57 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 7:37:04 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:07:09 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:07:09 [Note] WSREP: view(view_id(NON_PRIM,0ccb04a4-c970-11e2-0800-838856078bd7,477) memb {
d44ec52a-cca5-11e2-0800-6a48c84c66c6,
} joined {
} left {
} partitioned {
0ccb04a4-c970-11e2-0800-838856078bd7,
11f53dbb-c971-11e2-0800-987abecafd36,
})
130604 8:07:09 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130604 8:07:09 [Note] WSREP: Flow-control interval: [16, 16]
130604 8:07:09 [Note] WSREP: Received NON-PRIMARY.
130604 8:07:09 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 433353)
130604 8:07:09 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433353, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130604 8:07:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 8:07:09 [Note] WSREP: view(view_id(NON_PRIM,d44ec52a-cca5-11e2-0800-6a48c84c66c6,478) memb {
d44ec52a-cca5-11e2-0800-6a48c84c66c6,
} joined {
} left {
} partitioned {
0ccb04a4-c970-11e2-0800-838856078bd7,
11f53dbb-c971-11e2-0800-987abecafd36,
})
130604 8:07:09 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130604 8:07:09 [Note] WSREP: Flow-control interval: [16, 16]
130604 8:07:09 [Note] WSREP: Received NON-PRIMARY.
130604 8:07:09 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433353, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130604 8:07:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 8:07:10 [Note] WSREP: declaring 0ccb04a4-c970-11e2-0800-838856078bd7 stable
130604 8:07:10 [Note] WSREP: declaring 11f53dbb-c971-11e2-0800-987abecafd36 stable
130604 8:07:10 [Note] WSREP: Node 0ccb04a4-c970-11e2-0800-838856078bd7 state prim
130604 8:07:10 [Note] WSREP: view(view_id(PRIM,0ccb04a4-c970-11e2-0800-838856078bd7,479) memb {
0ccb04a4-c970-11e2-0800-838856078bd7,
11f53dbb-c971-11e2-0800-987abecafd36,
d44ec52a-cca5-11e2-0800-6a48c84c66c6,
} joined {
} left {
} partitioned {
})
130604 8:07:10 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
130604 8:07:10 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130604 8:07:10 [Note] WSREP: STATE EXCHANGE: sent state msg: c19d7175-cced-11e2-0800-74f173924529
130604 8:07:10 [Note] WSREP: STATE EXCHANGE: got state msg: c19d7175-cced-11e2-0800-74f173924529 from 0 (node2)
130604 8:07:10 [Note] WSREP: STATE EXCHANGE: got state msg: c19d7175-cced-11e2-0800-74f173924529 from 1 (node3)
130604 8:07:10 [Note] WSREP: STATE EXCHANGE: got state msg: c19d7175-cced-11e2-0800-74f173924529 from 2 (node1)
130604 8:07:10 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 360,
members = 2/3 (joined/total),
act_id = 433354,
last_appl. = 0,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = e38d9bfc-7d4e-11e2-0800-9446319d18cf
130604 8:07:10 [Note] WSREP: Flow-control interval: [28, 28]
130604 8:07:10 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 433354)
130604 8:07:10 [Note] WSREP: State transfer required:
Group state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433354
Local state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433353
130604 8:07:10 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433354, view# 361: Primary, number of nodes: 3, my index: 2, protocol version 2
130604 8:07:10 [Warning] WSREP: Gap in state sequence. Need state transfer.
130604 8:07:16 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:07:16 [Note] WSREP: You have configured 'rsync' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
130604 8:07:16 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 8:07:16 [Note] WSREP: Assign initial position for certification: 433354, protocol version: 2
130604 8:07:16 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.58.249.142:4568
130604 8:07:17 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.3:4567 tcp://192.168.0.2:4567
130604 8:07:17 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 11f53dbb-c971-11e2-0800-987abecafd36 (tcp://192.168.0.3:4567), attempt 0
130604 8:07:17 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 0ccb04a4-c970-11e2-0800-838856078bd7 (tcp://192.168.0.2:4567), attempt 0
130604 8:07:20 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 11f53dbb-c971-11e2-0800-987abecafd36 (tcp://192.168.0.3:4567), attempt 0
130604 8:07:20 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 0ccb04a4-c970-11e2-0800-838856078bd7 (tcp://192.168.0.2:4567), attempt 0
130604 8:07:20 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:07:21 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 11f53dbb-c971-11e2-0800-987abecafd36 (tcp://192.168.0.3:4567), attempt 0
130604 8:07:21 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 0ccb04a4-c970-11e2-0800-838856078bd7 (tcp://192.168.0.2:4567), attempt 0
130604 8:07:21 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') turning message relay requesting off
130604 8:07:22 [Note] WSREP: evs::proto(d44ec52a-cca5-11e2-0800-6a48c84c66c6, OPERATIONAL, view_id(REG,0ccb04a4-c970-11e2-0800-838856078bd7,479)) suspecting node: 0ccb04a4-c970-11e2-0800-838856078bd7
130604 8:07:22 [Note] WSREP: evs::proto(d44ec52a-cca5-11e2-0800-6a48c84c66c6, OPERATIONAL, view_id(REG,0ccb04a4-c970-11e2-0800-838856078bd7,479)) suspecting node: 11f53dbb-c971-11e2-0800-987abecafd36
130604 8:07:23 [Note] WSREP: view(view_id(NON_PRIM,0ccb04a4-c970-11e2-0800-838856078bd7,479) memb {
d44ec52a-cca5-11e2-0800-6a48c84c66c6,
} joined {
} left {
} partitioned {
0ccb04a4-c970-11e2-0800-838856078bd7,
11f53dbb-c971-11e2-0800-987abecafd36,
})
130604 8:07:23 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130604 8:07:23 [Note] WSREP: Flow-control interval: [16, 16]
130604 8:07:23 [Note] WSREP: Received NON-PRIMARY.
130604 8:07:23 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 433354)
130604 8:07:23 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_state_request():950: Requesting state transfer while in NON-PRIMARY. Ignoring.
130604 8:07:23 [ERROR] WSREP: Requesting state transfer failed: -125(Operation canceled)
130604 8:07:23 [ERROR] WSREP: State transfer request failed unrecoverably: 125 (Operation canceled). Most likely it is due to inability to communicate with the cluster primary$
130604 8:07:23 [Note] WSREP: Closing send monitor...
130604 8:07:23 [Note] WSREP: Closed send monitor.
130604 8:07:23 [Note] WSREP: gcomm: terminating thread
130604 8:07:23 [Note] WSREP: gcomm: joining thread
130604 8:07:23 [Note] WSREP: view(view_id(NON_PRIM,d44ec52a-cca5-11e2-0800-6a48c84c66c6,480) memb {
d44ec52a-cca5-11e2-0800-6a48c84c66c6,
} joined {
} left {
} partitioned {
0ccb04a4-c970-11e2-0800-838856078bd7,
11f53dbb-c971-11e2-0800-987abecafd36,
})
130604 8:07:23 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130604 8:07:23 [Note] WSREP: gcomm: closing backend
130604 8:07:23 [Note] WSREP: view((empty))
130604 8:07:23 [Note] WSREP: gcomm: closed
130604 8:07:23 [Note] WSREP: Flow-control interval: [16, 16]
130604 8:07:23 [Note] WSREP: Received NON-PRIMARY.
130604 8:07:23 [Note] WSREP: Received self-leave message.
130604 8:07:23 [Note] WSREP: Flow-control interval: [0, 0]
130604 8:07:23 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130604 8:07:23 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 433354)
130604 8:07:23 [Note] WSREP: RECV thread exiting 0: Success
130604 8:07:23 [Note] WSREP: recv_thread() joined.
130604 8:07:23 [Note] WSREP: Closing slave action queue.
130604 8:07:23 [Note] WSREP: /usr/sbin/mysqld: Terminated.

If i restart the node, it re-joins the cluster fine and runs without errors until the next crash at approximately the same amount of time lapse from the restart as the last crash (hope that makes sense!).

Any help or suggestions as to why this is occurring appreciated.

Eli

percona.jayj · June 5, 2013, 10:39am

It’d be good to see what is in the log before this (if anything). It looks to me like the node lost connection tot he cluster, tried to reconnect, couldn’t get an IST, then shutdown as a result. Doesn’t seem like a “crash” per-se.

elijahpaul · June 6, 2013, 7:35am

Yep. You’re right. ‘Crash’ is not really the right word. I should really re-title it to ‘Cluster Node Losing Connection Once a Day’.

I’ve updated my original post with some of the log, before the node loses its connection.

Hope that helps.

Topic		Replies	Views
Percona xtradb cluster primary node failure Percona XtraDB Cluster 5.x percona	1	181	January 10, 2025
Percona Xtradb Cluster Master node down Percona XtraDB Cluster 5.x	1	213	January 22, 2025
cluster crashes on Node Crash Percona XtraDB Cluster 5.x	0	507	October 19, 2012
All Node is crash!!!! Percona XtraDB Cluster 5.x	0	1136	February 23, 2014
Have an issue with cluster 2 Nodes keep dropping offline every day & rejoin issues. Percona XtraDB Cluster 5.x	2	1696	June 19, 2015

Percona XtraDB Cluster Node Crashing Once a Day

Related topics