Percona Cluster node goes down.

Hi there,

I have PXC setup in Amazon VPC, all nodes are in same region but one node from three is in different availability zone. One some point in time one node fails without any meaningful output in logs:


2014-04-03 01:30:38 8514 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.68236S), skipping check
2014-04-03 01:30:40 8514 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.53889S), skipping check
140403 03:30:39 mysqld_safe Number of processes running now: 0
140403 03:30:39 mysqld_safe WSREP: not restarting wsrep node automatically
140403 03:30:39 mysqld_safe mysqld from pid file /var/lib/mysql/ip-10-1-7-180.pid ended

This node is the one that in another availability zone.

This is my.cnf file:


[mysqld]
datadir=/var/lib/mysql
user=mysql
wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_cluster_address=gcomm://10.1.7.180,10.1.8.159,10.1.8.16
binlog_format=ROW
default_storage_engine=InnoDB
innodb_locks_unsafe_for_binlog=1
innodb_buffer_pool_size = 5632M
innodb_log_buffer_size = 4M
max_connect_errors = 10000
key_buffer_size = 2048M
max_allowed_packet = 50M
table_open_cache = 1024
sort_buffer_size = 2M
read_buffer_size = 2M
read_rnd_buffer_size = 80M
myisam_sort_buffer_size = 64M
thread_cache_size = 32
query_cache_size = 32M
innodb_thread_concurrency = 8
innodb_flush_method=O_DIRECT
innodb_log_file_size=1G
innodb_autoinc_lock_mode=2
wsrep_node_address=10.1.7.180
wsrep_sst_method=xtrabackup
wsrep_cluster_name=my_centos_cluster
wsrep_sst_auth="sstuser:s3cret"
max_connections = 4000
[mysql]
prompt=\\u@\\h [\\d]>\\_

The question is how can I investigate the root cause of the failure please? Also another question, what would be if update query will arrive on the node that is in “Joining: receiving State Transfer” state

Thank you in advance.

A message like “140403 03:30:39 mysqld_safe Number of processes running now: 0” without anything logged by mysql prior to that, means your mysqld process was killed, most likely by OOMkiller. Check the system log.
Joining node will refuse to accept connections until it synchronizes with cluster.