xtradb cluster crashing problem

We converted an existing single point failure mysql to a percona xtradb cluster last wednesday and have been having some issues.

Description:
During normal load everything works completely fine.

On saturday during nightly maintenance jobs the nodes started to become unresponsive, loaded, and/or failed.
Running unclustered the node can easily handle all incoming requests and sits about 0.1 load most of the time.
The nightly maintenance does spike cpu and IO load from 3:00AM to 3:40AM-ish
However it doesnt appear that the load should be soo excessive as to bring down both nodes of the cluster.
I am struggling to understand why a single unclustered mysql handles it fine, but a clustered mysql crashes 2 nodes completely.
I can completely understand if it uses more resources and takes longer to do its tasks, but not crash.

Question: When xtradb cluster starts generating “gcache.page” files does that mean it isn’t keeping up with demand?

Information:
The cluster has around 250GB of actual mysql data.

Each of the 2 nodes is a dual quad core xeon E5520, 16GB ram, 74GB RAID1 SAS for OS, 600GB RAID10 15K SAS just for mysql data.

Both nodes running Ubuntu 12.04, with percona xtradb cluster 5.5.28-23.7-369.precise

Yes, I know I should add a 3rd node and it has been ordered but hasnt arrived yet. Yes, I should have just waited until I received the 3rd node to convert the existing mysql to a percona cluster.

Config:

Generated by Percona Configuration Wizard version REL5-20120208

Configuration name mysqlc03 generated at 2012-12-27 14:04:51

[mysql]

CLIENT

port = 3306
socket = /var/run/mysqld/mysqld.sock

[mysqld_safe]
wsrep_urls=gcomm://192.168.252.107:4567,gcomm://

[mysqld]

GENERAL

user = mysql
default_storage_engine = InnoDB
socket = /var/run/mysqld/mysqld.sock
pid_file = /var/run/mysqld/mysqld.pid

MyISAM

key_buffer_size = 32M
myisam_recover = FORCE,BACKUP

SAFETY

max_allowed_packet = 16M
max_connect_errors = 1000000
skip_name_resolve
#sql_mode = STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREAT E_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_ DATE,NO_ZERO_IN_DATE,ONLY_FULL_GROUP_BY
#sysdate_is_now = 1
#innodb = FORCE
#innodb_strict_mode = 1

DATA STORAGE

datadir = /var/lib/mysql/

BINARY LOGGING

log_bin = /var/lib/mysql/mysql-bin
expire_logs_days = 14
sync_binlog = 1

CACHES AND LIMITS

tmp_table_size = 32M
max_heap_table_size = 32M
query_cache_type = 0
query_cache_size = 0
max_connections = 3000
thread_cache_size = 100
open_files_limit = 65535
table_definition_cache = 1024
table_open_cache = 2048

INNODB

innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 256M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 14G # total system memory - ~2G for system processes if system is dedicated to mysql

LOGGING

log_error = /var/log/mysql/mysql-error.log
log_queries_not_using_indexes = 1
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log

XTRADB CLUSTERING

#wsrep_cluster_address=gcomm:// # on first node
#wsrep_cluster_address=gcomm://192.168.252.107,192.168.252.109, # on each additional node
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_provider_options = “gmcast.listen_addr=tcp://192.168.252.108;”
wsrep_sst_receive_address=192.168.252.108
wsrep_node_incoming_address=192.168.252.108
wsrep_node_name=node2 # name each node unique
wsrep_slave_threads=16 # set to the number of cpu cores
wsrep_sst_method=xtrabackup
wsrep_sst_auth=
wsrep_cluster_name=mysqlc03
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_locks_unsafe_for_binlog=1

Nightly Procedure log:
130104 3:01:52 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 162693128 bytes
130104 3:01:59 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000
… < Every day between 3AM and 3:40ish it generates these gcache.page files during nightly stored procedure jobs >
130106 3:29:33 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000011 of size 372497251 bytes
130106 3:29:45 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000011

Sunday crash log:
Node1:
130106 3:29:33 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000011 of size 372497251 bytes
130106 3:29:45 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000011
130106 3:40:54 [Warning] Too many connections
130106 3:40:56 [Warning] Too many connections

130106 4:06:16 [Warning] Too many connections
130106 4:06:17 [Note] WSREP: Read nil XID from storage engines, skipping position init
130106 4:06:17 [Note] WSREP: wsrep_load(): loading provider library ‘/usr/lib/libgalera_smm.so’
130106 4:06:17 [Note] WSREP: wsrep_load(): Galera 2.1(r113) by Codership Oy <info&#64;codership.com> loaded succesfully.
130106 4:06:17 [Note] WSREP: Found saved state: 510f28eb-5582-11e2-0800-414351a60eb6:-1
130106 4:06:17 [Note] WSREP: Reusing existing ‘/var/lib/mysql//galera.cache’.
<Node 1 appears to start a second instance of mysqld for some reason>
130106 4:06:18 InnoDB: Initializing buffer pool, size = 6.0G
130106 4:06:18 [Warning] Too many connections
130106 4:06:18 InnoDB: Completed initialization of buffer pool
InnoDB: Unable to lock ./ibdata1, error: 11
InnoDB: Check that you do not already have another mysqld process
InnoDB: using the same InnoDB data or log files.
130106 4:06:18 InnoDB: Retrying to lock the first data file
130106 4:06:18 [Warning] Too many connections
< it generates repeated “unable to lock ./ibdata1” errors at the same time as “Too many connections” from the first mysql instance
However it is unresponsive to queries. >

Node2:
130106 3:43:40 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000011
130106 3:43:41 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000012 of size 312412590 bytes
130106 3:58:15 [Warning] Too many connections
130106 3:58:17 [Warning] Too many connections

130106 4:11:10 [Warning] Too many connections
130106 4:11:11 [Warning] Too many connections
130106 4:11:11 [Note] /usr/sbin/mysqld: Normal shutdown
<At this point I shut down this node as it is unresponsive to all queries and just dumping “Too many connections” to the logs
The load is incoming insert writes from a web application on the order of 15 inserts per second.
The maintenance procedures are purging old entries and generating a few reports.>
<I kill both mysql processes and restart both cluster nodes and sync them by deleting
/var/lib/mysql/grastate.dat on node2 to trigger an SST>

Monday crash log:
Node1:
130107 3:04:58 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 419565861 bytes
130107 3:05:18 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000

130107 3:43:12 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000005 of size 564372059 bytes
130107 3:43:37 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000005
130107 3:45:07 [Note] WSREP: (19068069-57ef-11e2-0800-1a8ff159f030, ‘tcp://192.168.252.107:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.252.108:4567
130107 3:45:08 [Note] WSREP: (19068069-57ef-11e2-0800-1a8ff159f030, ‘tcp://192.168.252.107:4567’) reconnecting to 724621c5-57ea-11e2-0800-3eac77a6d452 (tcp://192.168.252.108:4567), attempt 0
130107 3:45:09 [Note] WSREP: evs::proto(19068069-57ef-11e2-0800-1a8ff159f030, OPERATIONAL, view_id(REG,19068069-57ef-11e2-0800-1a8ff159f030,2)) suspecting node: 724621c5-57ea-11e2-0800-3eac77a6d452

Node2:
130107 3:04:58 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 419565861 bytes
130107 3:11:42 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000

130107 3:43:12 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000005 of size 564372059 bytes
Killed
130107 03:45:07 mysqld_safe Number of processes running now: 0
130107 03:45:07 mysqld_safe WSREP: not restarting wsrep node automatically
130107 03:45:07 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

Reading through the Percona Xtradb Cluster Limitations page again I see,

“DELETE operation is unsupported on tables without primary key. Also rows in tables without primary key may appear in different order on different nodes. As a result SELECT…LIMIT… may return slightly different sets.”

I went and looked and quite a few large tables do not have keys or indexes. Do you think this could actually cause the behavior I was seeing?