Just had a look at the mysql error log.
I have a lot of these before the cluster crashed and I went over to bootstrap mode:
130926 20:43:18 [Warning] Too many connections
----I have since substantially improved front-end caching and should send less than half the traffic to the databases now.
Saturday morning the cluster became unstable and I had a lot of these:
130928 7:07:40 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
…
130928 7:07:40 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.0.0.5:4567
130928 7:07:41 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) reconnecting to 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567), attempt 0
130928 7:07:42 [Note] WSREP: evs::proto(4864b4ca-1969-11e3-91f6-efdb7950b797, GATHER, view_id(REG,240fee80-1969-11e3-b3f3-0b9e4ca22bc1,260)) suspecting node: 36f066e7-1969-11e3-b3e4-bfffdefe859f
130928 7:07:43 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:43 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:43 [Note] WSREP: declaring 240fee80-1969-11e3-b3f3-0b9e4ca22bc1 stable
130928 7:07:43 [Note] WSREP: Node 240fee80-1969-11e3-b3f3-0b9e4ca22bc1 state prim
130928 7:07:43 [Note] WSREP: view(view_id(PRIM,240fee80-1969-11e3-b3f3-0b9e4ca22bc1,261) memb {
240fee80-1969-11e3-b3f3-0b9e4ca22bc1,
4864b4ca-1969-11e3-91f6-efdb7950b797,
} joined {
} left {
} partitioned {
36f066e7-1969-11e3-b3e4-bfffdefe859f,
})
130928 7:07:43 [Note] WSREP: forgetting 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567)
130928 7:07:43 [Note] WSREP: deleting entry tcp://10.0.0.5:4567
130928 7:07:43 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
130928 7:07:43 [Note] WSREP: forgetting 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://197.242.148.230:4567)
130928 7:07:43 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130928 7:07:43 [Note] WSREP: deleting entry tcp://197.242.148.230:4567
130928 7:07:43 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:43 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
130928 7:07:43 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:43 [Note] WSREP: STATE EXCHANGE: sent state msg: e838de4f-27fb-11e3-8c27-7ea8d538500b
130928 7:07:43 [Note] WSREP: STATE EXCHANGE: got state msg: e838de4f-27fb-11e3-8c27-7ea8d538500b from 0 (###5)
130928 7:07:43 [Note] WSREP: STATE EXCHANGE: got state msg: e838de4f-27fb-11e3-8c27-7ea8d538500b from 1 (###7)
130928 7:07:43 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 219,
members = 2/2 (joined/total),
act_id = 5954217,
last_appl. = 5953977,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = 73cc2dce-0189-11e3-be43-e2d7eeaef85e
130928 7:07:43 [Note] WSREP: Flow-control interval: [23, 23]
130928 7:07:43 [Note] WSREP: New cluster view: global state: 73cc2dce-0189-11e3-be43-e2d7eeaef85e:5954217, view# 220: Primary, number of nodes: 2, my index: 1, protocol version 2
130928 7:07:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130928 7:07:43 [Note] WSREP: Assign initial position for certification: 5954217, protocol version: 2
130928 7:07:45 [Warning] WSREP: discarding established (time wait) 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567)
130928 7:07:45 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:46 [Warning] WSREP: discarding established (time wait) 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567)
130928 7:07:46 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:48 [Warning] WSREP: discarding established (time wait) 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567)
130928 7:07:48 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:49 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
130928 7:07:49 [Note] WSREP: cleaning up 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://10.0.0.5:4567)
130928 7:07:49 [Note] WSREP: cleaning up 36f066e7-1969-11e3-b3e4-bfffdefe859f (tcp://197.242.148.230:4567)
130928 7:07:49 [Warning] WSREP: evs::proto(4864b4ca-1969-11e3-91f6-efdb7950b797, GATHER, view_id(REG,240fee80-1969-11e3-b3f3-0b9e4ca22bc1,261)) source 36f066e7-1969-11e3-b3e4-bfffdefe859f is not supposed to be representative
130928 7:07:50 [Note] WSREP: (4864b4ca-1969-11e3-91f6-efdb7950b797, ‘tcp://0.0.0.0:4567’) address ‘tcp://10.0.0.8:4567’ pointing to uuid 4864b4ca-1969-11e3-91f6-efdb7950b797 is blacklisted, skipping
…etc…
---- and a bit later this:
130928 7:21:17 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130928 7:21:17 [Note] WSREP: STATE EXCHANGE: sent state msg: cd389623-27fd-11e3-8df6-6327f157c0f2
130928 7:21:17 [Note] WSREP: STATE EXCHANGE: got state msg: cd389623-27fd-11e3-8df6-6327f157c0f2 from 0 (###5)
130928 7:21:17 [Note] WSREP: STATE EXCHANGE: got state msg: cd389623-27fd-11e3-8df6-6327f157c0f2 from 1 (###4)
130928 7:21:17 [Note] WSREP: STATE EXCHANGE: got state msg: cd389623-27fd-11e3-8df6-6327f157c0f2 from 2 (###7)
130928 7:21:17 [Warning] WSREP: Quorum: No node with complete state:
Version : 2
Flags : 3
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : b9d06527-27fd-11e3-a1dd-36ebc5d2f591
Prim seqno : 223
Last seqno : 5954716
Prim JOINED : 3
State UUID : cd389623-27fd-11e3-8df6-6327f157c0f2
Group UUID : 73cc2dce-0189-11e3-be43-e2d7eeaef85e
Name : ‘###5’
Incoming addr: ‘10.0.0.6:3306’
Version : 2
Flags : 2
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : b9d06527-27fd-11e3-a1dd-36ebc5d2f591
Prim seqno : 223
Last seqno : 5954716
Prim JOINED : 3
State UUID : cd389623-27fd-11e3-8df6-6327f157c0f2
Group UUID : 73cc2dce-0189-11e3-be43-e2d7eeaef85e
Name : ‘###4’
Incoming addr: ‘10.0.0.5:3306’
Version : 2
Flags : 2
Protocols : 0 / 4 / 2
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : b9d06527-27fd-11e3-a1dd-36ebc5d2f591
Prim seqno : 223
Last seqno : 5954716
Prim JOINED : 3
State UUID : cd389623-27fd-11e3-8df6-6327f157c0f2
Group UUID : 73cc2dce-0189-11e3-be43-e2d7eeaef85e
Name : ‘###7’
Incoming addr: ‘10.0.0.8:3306’
130928 7:21:17 [Note] WSREP: Full re-merge of primary b9d06527-27fd-11e3-a1dd-36ebc5d2f591 found: 3 of 3.
130928 7:21:17 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 223,
members = 3/3 (joined/total),
act_id = 5954716,
—and eventually the cluster froze. I had to restart one system in bootstrap mode and that is how it is still running.