Three nodes cluster shutdowning randomly

Arelag1 · June 8, 2017, 6:21am

Hello,

We deployed a three nodes cluster using PXC last year for one of our client and it is working nicely but we got some random crash affecting the entire cluster. It happened again yesterday and the complete cluster shut down without warning.
MySQL logs on the first node are in attachment (galera_logs_1.txt).

It seems there is some communication problems between nodes as mentioned on the first line : turning message relay requesting on, nonlive peers.
I’m not sure what the root cause can be :can this be network related or load related (load average or SQL traffic, numbers of requests) ? Is there some parameters to adjust ?

After that, I tried bootstrapping the cluster but got another shutdown I don’t understand : I bootstrapped the first node, restarted the second which initiated a SST.
After the second node was up and running (WSREP state Synced), I restarted the third node and the two other nodes stopped immediatly.
I put the messages from error log in attachment (galera_logs_2.txt).

It’s not the first time I have to reset a PXC cluster like that but I don’t understand why the last node created this situation.
Am I missing something ?

For information, we are using Debian with the following packages :

ii percona-xtradb-cluster-56 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster with Galera
ii percona-xtradb-cluster-client-5.6 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster database client binaries
ii percona-xtradb-cluster-common-5.6 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster database common files (e.g. /etc/mysql/my.cnf)
ii percona-xtradb-cluster-galera-3 3.14-1.wheezy amd64 Metapackage for latest version of galera3.
ii percona-xtradb-cluster-galera-3.x 3.14-1.wheezy amd64 Galera components of Percona XtraDB Cluster
ii percona-xtradb-cluster-server-5.6 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster database server binaries

I guess an upgrade of those versions is a must have here.

Configuration File :

[mysqld]

# Cluster configuration
wsrep_provider = /usr/lib/libgalera_smm.so
wsrep_forced_binlog_format = ROW
wsrep_cluster_address = gcomm://10.16.0.92,10.16.0.93,10.16.0.94
wsrep_slave_threads = 64
wsrep_sst_method = xtrabackup-v2
wsrep_sst_auth = XXXX:XXXX
wsrep_cluster_name = galera
wsrep_node_name = client
wsrep_node_address = 10.16.0.92
wsrep_causal_reads = OFF
wsrep_provider_options = "gcache.size = 50G; gcs.fc_limit = 64"

wsrep_retry_autocommit = 1
wsrep_debug = 0

Thanks for any information about that case.

galera_logs_1.txt (10.5 KB)

galera_logs_2.txt (4.67 KB)

jrivera · June 8, 2017, 6:44am

2017-06-08 10:32:07 64450 [ERROR] WSREP: Certification failed for TO isolated action: source: 70435979-4b6e-11e7-86e1-ba8a94aea198 version: 3 local: 1 state: CERTIFYI35, d: -1, ts: 9335174392156268)

Something is making your nodes inconsistent but it’s not clear from your error logs. you can set wsrep_debug=1 on both nodes and maybe we can see more information.

Arelag1 · June 8, 2017, 8:44am

Thank you for your answer, I did see that line but as it appears later in the message, I wasn’t sure it was the cause or not.
I’ll try to enable wsrep_debug but is the effect on performance important ? It is a production cluster with quite a lot of traffic so I have to be careful.

Besides that, do you think an upgrade of PXC and galera packages could help ?

Topic		Replies	Views
my percona xtraDB cluster suddently dead and how to fix it Percona XtraDB Cluster 8.x community , mysql , percona	14	2311	August 27, 2020
Help my percona xtradb cluster stuck Percona XtraDB Cluster 5.x	1	1302	October 4, 2021
All nodes in the cluster becomes inaccessible Percona XtraDB Cluster 5.x	9	5459	July 31, 2014
cluster crashes on Node Crash Percona XtraDB Cluster 5.x	0	493	October 19, 2012
My cluster crash randomly Percona XtraDB Cluster 5.x	2	932	June 18, 2015

Three nodes cluster shutdowning randomly

Related topics