We deployed a three nodes cluster using PXC last year for one of our client and it is working nicely but we got some random crash affecting the entire cluster. It happened again yesterday and the complete cluster shut down without warning.
MySQL logs on the first node are in attachment (galera_logs_1.txt).
It seems there is some communication problems between nodes as mentioned on the first line : turning message relay requesting on, nonlive peers.
I’m not sure what the root cause can be :can this be network related or load related (load average or SQL traffic, numbers of requests) ? Is there some parameters to adjust ?
After that, I tried bootstrapping the cluster but got another shutdown I don’t understand : I bootstrapped the first node, restarted the second which initiated a SST.
After the second node was up and running (WSREP state Synced), I restarted the third node and the two other nodes stopped immediatly.
I put the messages from error log in attachment (galera_logs_2.txt).
It’s not the first time I have to reset a PXC cluster like that but I don’t understand why the last node created this situation.
Am I missing something ?
For information, we are using Debian with the following packages :
ii percona-xtradb-cluster-56 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster with Galera ii percona-xtradb-cluster-client-5.6 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster database client binaries ii percona-xtradb-cluster-common-5.6 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster database common files (e.g. /etc/mysql/my.cnf) ii percona-xtradb-cluster-galera-3 3.14-1.wheezy amd64 Metapackage for latest version of galera3. ii percona-xtradb-cluster-galera-3.x 3.14-1.wheezy amd64 Galera components of Percona XtraDB Cluster ii percona-xtradb-cluster-server-5.6 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster database server binaries
I guess an upgrade of those versions is a must have here.
Configuration File :
[mysqld] # Cluster configuration wsrep_provider = /usr/lib/libgalera_smm.so wsrep_forced_binlog_format = ROW wsrep_cluster_address = gcomm://10.16.0.92,10.16.0.93,10.16.0.94 wsrep_slave_threads = 64 wsrep_sst_method = xtrabackup-v2 wsrep_sst_auth = XXXX:XXXX wsrep_cluster_name = galera wsrep_node_name = client wsrep_node_address = 10.16.0.92 wsrep_causal_reads = OFF wsrep_provider_options = "gcache.size = 50G; gcs.fc_limit = 64" wsrep_retry_autocommit = 1 wsrep_debug = 0
Thanks for any information about that case.
galera_logs_1.txt (10.5 KB)
galera_logs_2.txt (4.67 KB)