I have a cluster with 3 nodes, these cluster have been running for more than 1 year without major problems, However about 1 month ago the cluster begin to crash and I could not find any reason.
The node 2 crashes and then few minutes later another node (1 or 3) crashes too, and then with just one node online the cluster becomes inactive.
Before node 2 crashes it presents the message “[ERROR] /usr/sbin/mysqld: Table ‘./mysql/proc’ is marked as crashed and should be repaired”, however there was some cases where it just crash without any error message, There is no problem with the table mysql/proc, after I restart the node I ran the command CHECK TABLE proc; and no error is identified.
The last time that node 2 crashed I got the following messages from the log.
2016-02-16 09:07:29 11404 [ERROR] Can’t create thread to handle request (errno= 11)
2016-02-16 09:07:43 11404 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.60651S), skipping check
2016-02-16 09:07:47 11404 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.80871S), skipping check
2016-02-16 09:07:50 11404 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.75565S), skipping check
Killed
160216 09:08:00 mysqld_safe Number of processes running now: 0
160216 09:08:00 mysqld_safe WSREP: not restarting wsrep node automatically
160216 09:08:00 mysqld_safe mysqld from pid file /var/lib/mysql/mysql.pid ended
These messages does not tell me much, so I would like to know if is there some technique to debug the server and identify the cause of a problem.
Following the log file of the node 1, that crashed few minutes after this last node 2 crash occurence.
2016-02-16 09:11:04 4065 [Note] WSREP: save pc into disk
2016-02-16 09:11:04 4065 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2016-02-16 09:11:04 4065 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 3477dc6f-d48d-11e5-85cf-5add629b6366
2016-02-16 09:11:04 4065 [Note] WSREP: STATE EXCHANGE: sent state msg: 3477dc6f-d48d-11e5-85cf-5add629b6366
2016-02-16 09:11:04 4065 [Note] WSREP: STATE EXCHANGE: got state msg: 3477dc6f-d48d-11e5-85cf-5add629b6366 from 0 (dbnode1)
2016-02-16 09:11:04 4065 [Note] WSREP: STATE EXCHANGE: got state msg: 3477dc6f-d48d-11e5-85cf-5add629b6366 from 1 (dbnode3)
2016-02-16 09:11:04 4065 [Warning] WSREP: Quorum: No node with complete state:
Version : 3
Flags : 0x3
Protocols : 0 / 6 / 3
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : 1895c426-d48d-11e5-b4ce-06e66b576507
Prim seqno : 47
First seqno : 500685788
Last seqno : 500808605
Prim JOINED : 2
State UUID : 3477dc6f-d48d-11e5-85cf-5add629b6366
Group UUID : 01e3cc97-8eca-11e4-8ac5-5625a2097df9
Name : ‘dbnode1’
Incoming addr: ‘x.x.x.7:3306’
Version : 3
Flags : 0x2
Protocols : 0 / 6 / 3
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : 1895c426-d48d-11e5-b4ce-06e66b576507
Prim seqno : 47
First seqno : 500808520
Last seqno : 500808605
Prim JOINED : 2
State UUID : 3477dc6f-d48d-11e5-85cf-5add629b6366
Group UUID : 01e3cc97-8eca-11e4-8ac5-5625a2097df9
Name : ‘dbnode3’
Incoming addr: ‘x.x.x.4:3306’
2016-02-16 09:11:04 4065 [Note] WSREP: Full re-merge of primary 1895c426-d48d-11e5-b4ce-06e66b576507 found: 2 of 2.
2016-02-16 09:11:04 4065 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 47,
members = 2/2 (joined/total),
act_id = 500808605,
last_appl. = 500808527,
protocols = 0/6/3 (gcs/repl/appl),
group UUID = 01e3cc97-8eca-11e4-8ac5-5625a2097df9
2016-02-16 09:11:04 4065 [Note] WSREP: Flow-control interval: [23, 23]
2016-02-16 09:11:04 4065 [Note] WSREP: Restored state OPEN → SYNCED (500808605)
2016-02-16 09:11:04 4065 [Note] WSREP: New cluster view: global state: 01e3cc97-8eca-11e4-8ac5-5625a2097df9:500808605, view# 48: Primary, number of nodes: 2, my index: 0, protocol version 3
2016-02-16 09:11:04 4065 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-02-16 09:11:04 4065 [Note] WSREP: REPL Protocols: 6 (3, 2)
2016-02-16 09:11:04 4065 [Note] WSREP: Service thread queue flushed.
2016-02-16 09:11:04 4065 [Note] WSREP: Assign initial position for certification: 500808605, protocol version: 3
2016-02-16 09:11:04 4065 [Note] WSREP: Service thread queue flushed.
2016-02-16 09:11:04 4065 [Note] WSREP: Synchronized with group, ready for connections
2016-02-16 09:11:04 4065 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-02-16 09:11:05 4065 [Warning] WSREP: last inactive check more than PT1.5S ago (PT4.28429S), skipping check
2016-02-16 09:11:09 4065 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.81872S), skipping check
Killed
160216 09:11:13 mysqld_safe Number of processes running now: 0
160216 09:11:13 mysqld_safe WSREP: not restarting wsrep node automatically
160216 09:11:13 mysqld_safe mysqld from pid file /var/lib/mysql/mysql.pid ended <<+++++++++++++++++++++++++++++++ DATABASE CRASH +++++++++++++++
160216 10:16:51 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql/ << ============ MANUAL RESTART ==============
Thanks