xtradb cluster crashing problem

We converted an existing single point failure mysql to a percona xtradb cluster last wednesday and have been having some issues.

During normal load everything works completely fine.

On saturday during nightly maintenance jobs the nodes started to become unresponsive, loaded, and/or failed.
Running unclustered the node can easily handle all incoming requests and sits about 0.1 load most of the time.
The nightly maintenance does spike cpu and IO load from 3:00AM to 3:40AM-ish
However it doesnt appear that the load should be soo excessive as to bring down both nodes of the cluster.
I am struggling to understand why a single unclustered mysql handles it fine, but a clustered mysql crashes 2 nodes completely.
I can completely understand if it uses more resources and takes longer to do its tasks, but not crash.

Question: When xtradb cluster starts generating “gcache.page” files does that mean it isn’t keeping up with demand?

The cluster has around 250GB of actual mysql data.

Each of the 2 nodes is a dual quad core xeon E5520, 16GB ram, 74GB RAID1 SAS for OS, 600GB RAID10 15K SAS just for mysql data.

Both nodes running Ubuntu 12.04, with percona xtradb cluster 5.5.28-23.7-369.precise

Yes, I know I should add a 3rd node and it has been ordered but hasnt arrived yet. Yes, I should have just waited until I received the 3rd node to convert the existing mysql to a percona cluster.


Generated by Percona Configuration Wizard version REL5-20120208

Configuration name mysqlc03 generated at 2012-12-27 14:04:51



port = 3306
socket = /var/run/mysqld/mysqld.sock




user = mysql
default_storage_engine = InnoDB
socket = /var/run/mysqld/mysqld.sock
pid_file = /var/run/mysqld/mysqld.pid


key_buffer_size = 32M
myisam_recover = FORCE,BACKUP


max_allowed_packet = 16M
max_connect_errors = 1000000
#sysdate_is_now = 1
#innodb = FORCE
#innodb_strict_mode = 1


datadir = /var/lib/mysql/


log_bin = /var/lib/mysql/mysql-bin
expire_logs_days = 14
sync_binlog = 1


tmp_table_size = 32M
max_heap_table_size = 32M
query_cache_type = 0
query_cache_size = 0
max_connections = 3000
thread_cache_size = 100
open_files_limit = 65535
table_definition_cache = 1024
table_open_cache = 2048


innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 256M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 14G # total system memory - ~2G for system processes if system is dedicated to mysql


log_error = /var/log/mysql/mysql-error.log
log_queries_not_using_indexes = 1
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log


#wsrep_cluster_address=gcomm:// # on first node
#wsrep_cluster_address=gcomm://,, # on each additional node
wsrep_provider_options = “gmcast.listen_addr=tcp://;”
wsrep_node_name=node2 # name each node unique
wsrep_slave_threads=16 # set to the number of cpu cores

Nightly Procedure log:
130104 3:01:52 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 162693128 bytes
130104 3:01:59 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000
… < Every day between 3AM and 3:40ish it generates these gcache.page files during nightly stored procedure jobs >
130106 3:29:33 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000011 of size 372497251 bytes
130106 3:29:45 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000011

Sunday crash log:
130106 3:29:33 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000011 of size 372497251 bytes
130106 3:29:45 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000011
130106 3:40:54 [Warning] Too many connections
130106 3:40:56 [Warning] Too many connections

130106 4:06:16 [Warning] Too many connections
130106 4:06:17 [Note] WSREP: Read nil XID from storage engines, skipping position init
130106 4:06:17 [Note] WSREP: wsrep_load(): loading provider library ‘/usr/lib/libgalera_smm.so’
130106 4:06:17 [Note] WSREP: wsrep_load(): Galera 2.1(r113) by Codership Oy <info&#64;codership.com> loaded succesfully.
130106 4:06:17 [Note] WSREP: Found saved state: 510f28eb-5582-11e2-0800-414351a60eb6:-1
130106 4:06:17 [Note] WSREP: Reusing existing ‘/var/lib/mysql//galera.cache’.
<Node 1 appears to start a second instance of mysqld for some reason>
130106 4:06:18 InnoDB: Initializing buffer pool, size = 6.0G
130106 4:06:18 [Warning] Too many connections
130106 4:06:18 InnoDB: Completed initialization of buffer pool
InnoDB: Unable to lock ./ibdata1, error: 11
InnoDB: Check that you do not already have another mysqld process
InnoDB: using the same InnoDB data or log files.
130106 4:06:18 InnoDB: Retrying to lock the first data file
130106 4:06:18 [Warning] Too many connections
< it generates repeated “unable to lock ./ibdata1” errors at the same time as “Too many connections” from the first mysql instance
However it is unresponsive to queries. >

130106 3:43:40 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000011
130106 3:43:41 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000012 of size 312412590 bytes
130106 3:58:15 [Warning] Too many connections
130106 3:58:17 [Warning] Too many connections

130106 4:11:10 [Warning] Too many connections
130106 4:11:11 [Warning] Too many connections
130106 4:11:11 [Note] /usr/sbin/mysqld: Normal shutdown
<At this point I shut down this node as it is unresponsive to all queries and just dumping “Too many connections” to the logs
The load is incoming insert writes from a web application on the order of 15 inserts per second.
The maintenance procedures are purging old entries and generating a few reports.>
<I kill both mysql processes and restart both cluster nodes and sync them by deleting
/var/lib/mysql/grastate.dat on node2 to trigger an SST>

Monday crash log:
130107 3:04:58 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 419565861 bytes
130107 3:05:18 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000

130107 3:43:12 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000005 of size 564372059 bytes
130107 3:43:37 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000005
130107 3:45:07 [Note] WSREP: (19068069-57ef-11e2-0800-1a8ff159f030, ‘tcp://’) turning message relay requesting on, nonlive peers: tcp://
130107 3:45:08 [Note] WSREP: (19068069-57ef-11e2-0800-1a8ff159f030, ‘tcp://’) reconnecting to 724621c5-57ea-11e2-0800-3eac77a6d452 (tcp://, attempt 0
130107 3:45:09 [Note] WSREP: evs::proto(19068069-57ef-11e2-0800-1a8ff159f030, OPERATIONAL, view_id(REG,19068069-57ef-11e2-0800-1a8ff159f030,2)) suspecting node: 724621c5-57ea-11e2-0800-3eac77a6d452

130107 3:04:58 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 419565861 bytes
130107 3:11:42 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000

130107 3:43:12 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000005 of size 564372059 bytes
130107 03:45:07 mysqld_safe Number of processes running now: 0
130107 03:45:07 mysqld_safe WSREP: not restarting wsrep node automatically
130107 03:45:07 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

Reading through the Percona Xtradb Cluster Limitations page again I see,

“DELETE operation is unsupported on tables without primary key. Also rows in tables without primary key may appear in different order on different nodes. As a result SELECT…LIMIT… may return slightly different sets.”

I went and looked and quite a few large tables do not have keys or indexes. Do you think this could actually cause the behavior I was seeing?