I have a 3 node Percona cluster setup on EC2 instances.
Node 1 of the cluster (orginaly the primary node) is now failing once a day, and I can’t figure out from the error log what is causing it.
Log exert:
130604 7:32:18 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
130604 7:32:18 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130604 7:32:18 [Note] WSREP: STATE EXCHANGE: sent state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46
130604 7:32:18 [Note] WSREP: STATE EXCHANGE: got state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46 from 0 (node2)
130604 7:32:18 [Note] WSREP: STATE EXCHANGE: got state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46 from 2 (node1)
130604 7:32:18 [Note] WSREP: STATE EXCHANGE: got state msg: e2e2cfa1-cce8-11e2-0800-e0694f428b46 from 1 (node3)
130604 7:32:18 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 359,
members = 3/3 (joined/total),
act_id = 433339,
last_appl. = 0,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = e38d9bfc-7d4e-11e2-0800-9446319d18cf
130604 7:32:18 [Note] WSREP: Flow-control interval: [28, 28]
130604 7:32:18 [Note] WSREP: Restored state OPEN -> SYNCED (433339)
130604 7:32:18 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433339, view# 360: Primary, number of nodes: 3, my index: 2, protocol version 2
130604 7:32:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 7:32:18 [Note] WSREP: Assign initial position for certification: 433339, protocol version: 2
130604 7:32:18 [Note] WSREP: Synchronized with group, ready for connections
130604 7:32:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 7:36:57 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 7:37:04 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:07:09 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:07:09 [Note] WSREP: view(view_id(NON_PRIM,0ccb04a4-c970-11e2-0800-838856078bd7,477) memb {
d44ec52a-cca5-11e2-0800-6a48c84c66c6,
} joined {
} left {
} partitioned {
0ccb04a4-c970-11e2-0800-838856078bd7,
11f53dbb-c971-11e2-0800-987abecafd36,
})
130604 8:07:09 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130604 8:07:09 [Note] WSREP: Flow-control interval: [16, 16]
130604 8:07:09 [Note] WSREP: Received NON-PRIMARY.
130604 8:07:09 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 433353)
130604 8:07:09 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433353, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130604 8:07:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 8:07:09 [Note] WSREP: view(view_id(NON_PRIM,d44ec52a-cca5-11e2-0800-6a48c84c66c6,478) memb {
d44ec52a-cca5-11e2-0800-6a48c84c66c6,
} joined {
} left {
} partitioned {
0ccb04a4-c970-11e2-0800-838856078bd7,
11f53dbb-c971-11e2-0800-987abecafd36,
})
130604 8:07:09 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130604 8:07:09 [Note] WSREP: Flow-control interval: [16, 16]
130604 8:07:09 [Note] WSREP: Received NON-PRIMARY.
130604 8:07:09 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433353, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
130604 8:07:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 8:07:10 [Note] WSREP: declaring 0ccb04a4-c970-11e2-0800-838856078bd7 stable
130604 8:07:10 [Note] WSREP: declaring 11f53dbb-c971-11e2-0800-987abecafd36 stable
130604 8:07:10 [Note] WSREP: Node 0ccb04a4-c970-11e2-0800-838856078bd7 state prim
130604 8:07:10 [Note] WSREP: view(view_id(PRIM,0ccb04a4-c970-11e2-0800-838856078bd7,479) memb {
0ccb04a4-c970-11e2-0800-838856078bd7,
11f53dbb-c971-11e2-0800-987abecafd36,
d44ec52a-cca5-11e2-0800-6a48c84c66c6,
} joined {
} left {
} partitioned {
})
130604 8:07:10 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
130604 8:07:10 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130604 8:07:10 [Note] WSREP: STATE EXCHANGE: sent state msg: c19d7175-cced-11e2-0800-74f173924529
130604 8:07:10 [Note] WSREP: STATE EXCHANGE: got state msg: c19d7175-cced-11e2-0800-74f173924529 from 0 (node2)
130604 8:07:10 [Note] WSREP: STATE EXCHANGE: got state msg: c19d7175-cced-11e2-0800-74f173924529 from 1 (node3)
130604 8:07:10 [Note] WSREP: STATE EXCHANGE: got state msg: c19d7175-cced-11e2-0800-74f173924529 from 2 (node1)
130604 8:07:10 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 360,
members = 2/3 (joined/total),
act_id = 433354,
last_appl. = 0,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = e38d9bfc-7d4e-11e2-0800-9446319d18cf
130604 8:07:10 [Note] WSREP: Flow-control interval: [28, 28]
130604 8:07:10 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 433354)
130604 8:07:10 [Note] WSREP: State transfer required:
Group state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433354
Local state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433353
130604 8:07:10 [Note] WSREP: New cluster view: global state: e38d9bfc-7d4e-11e2-0800-9446319d18cf:433354, view# 361: Primary, number of nodes: 3, my index: 2, protocol version 2
130604 8:07:10 [Warning] WSREP: Gap in state sequence. Need state transfer.
130604 8:07:16 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:07:16 [Note] WSREP: You have configured 'rsync' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
130604 8:07:16 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 8:07:16 [Note] WSREP: Assign initial position for certification: 433354, protocol version: 2
130604 8:07:16 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.58.249.142:4568
130604 8:07:17 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.3:4567 tcp://192.168.0.2:4567
130604 8:07:17 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 11f53dbb-c971-11e2-0800-987abecafd36 (tcp://192.168.0.3:4567), attempt 0
130604 8:07:17 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 0ccb04a4-c970-11e2-0800-838856078bd7 (tcp://192.168.0.2:4567), attempt 0
130604 8:07:20 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 11f53dbb-c971-11e2-0800-987abecafd36 (tcp://192.168.0.3:4567), attempt 0
130604 8:07:20 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 0ccb04a4-c970-11e2-0800-838856078bd7 (tcp://192.168.0.2:4567), attempt 0
130604 8:07:20 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:07:21 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 11f53dbb-c971-11e2-0800-987abecafd36 (tcp://192.168.0.3:4567), attempt 0
130604 8:07:21 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') reconnecting to 0ccb04a4-c970-11e2-0800-838856078bd7 (tcp://192.168.0.2:4567), attempt 0
130604 8:07:21 [Note] WSREP: (d44ec52a-cca5-11e2-0800-6a48c84c66c6, 'tcp://0.0.0.0:4567') turning message relay requesting off
130604 8:07:22 [Note] WSREP: evs::proto(d44ec52a-cca5-11e2-0800-6a48c84c66c6, OPERATIONAL, view_id(REG,0ccb04a4-c970-11e2-0800-838856078bd7,479)) suspecting node: 0ccb04a4-c970-11e2-0800-838856078bd7
130604 8:07:22 [Note] WSREP: evs::proto(d44ec52a-cca5-11e2-0800-6a48c84c66c6, OPERATIONAL, view_id(REG,0ccb04a4-c970-11e2-0800-838856078bd7,479)) suspecting node: 11f53dbb-c971-11e2-0800-987abecafd36
130604 8:07:23 [Note] WSREP: view(view_id(NON_PRIM,0ccb04a4-c970-11e2-0800-838856078bd7,479) memb {
d44ec52a-cca5-11e2-0800-6a48c84c66c6,
} joined {
} left {
} partitioned {
0ccb04a4-c970-11e2-0800-838856078bd7,
11f53dbb-c971-11e2-0800-987abecafd36,
})
130604 8:07:23 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130604 8:07:23 [Note] WSREP: Flow-control interval: [16, 16]
130604 8:07:23 [Note] WSREP: Received NON-PRIMARY.
130604 8:07:23 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 433354)
130604 8:07:23 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_state_request():950: Requesting state transfer while in NON-PRIMARY. Ignoring.
130604 8:07:23 [ERROR] WSREP: Requesting state transfer failed: -125(Operation canceled)
130604 8:07:23 [ERROR] WSREP: State transfer request failed unrecoverably: 125 (Operation canceled). Most likely it is due to inability to communicate with the cluster primary$
130604 8:07:23 [Note] WSREP: Closing send monitor...
130604 8:07:23 [Note] WSREP: Closed send monitor.
130604 8:07:23 [Note] WSREP: gcomm: terminating thread
130604 8:07:23 [Note] WSREP: gcomm: joining thread
130604 8:07:23 [Note] WSREP: view(view_id(NON_PRIM,d44ec52a-cca5-11e2-0800-6a48c84c66c6,480) memb {
d44ec52a-cca5-11e2-0800-6a48c84c66c6,
} joined {
} left {
} partitioned {
0ccb04a4-c970-11e2-0800-838856078bd7,
11f53dbb-c971-11e2-0800-987abecafd36,
})
130604 8:07:23 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130604 8:07:23 [Note] WSREP: gcomm: closing backend
130604 8:07:23 [Note] WSREP: view((empty))
130604 8:07:23 [Note] WSREP: gcomm: closed
130604 8:07:23 [Note] WSREP: Flow-control interval: [16, 16]
130604 8:07:23 [Note] WSREP: Received NON-PRIMARY.
130604 8:07:23 [Note] WSREP: Received self-leave message.
130604 8:07:23 [Note] WSREP: Flow-control interval: [0, 0]
130604 8:07:23 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130604 8:07:23 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 433354)
130604 8:07:23 [Note] WSREP: RECV thread exiting 0: Success
130604 8:07:23 [Note] WSREP: recv_thread() joined.
130604 8:07:23 [Note] WSREP: Closing slave action queue.
130604 8:07:23 [Note] WSREP: /usr/sbin/mysqld: Terminated.
If i restart the node, it re-joins the cluster fine and runs without errors until the next crash at approximately the same amount of time lapse from the restart as the last crash (hope that makes sense!).
Any help or suggestions as to why this is occurring appreciated.
Eli