Three nodes cluster shutdowning randomly

Hello,

We deployed a three nodes cluster using PXC last year for one of our client and it is working nicely but we got some random crash affecting the entire cluster. It happened again yesterday and the complete cluster shut down without warning.
MySQL logs on the first node are in attachment (galera_logs_1.txt).

It seems there is some communication problems between nodes as mentioned on the first line : turning message relay requesting on, nonlive peers.
I’m not sure what the root cause can be :can this be network related or load related (load average or SQL traffic, numbers of requests) ? Is there some parameters to adjust ?

After that, I tried bootstrapping the cluster but got another shutdown I don’t understand : I bootstrapped the first node, restarted the second which initiated a SST.
After the second node was up and running (WSREP state Synced), I restarted the third node and the two other nodes stopped immediatly.
I put the messages from error log in attachment (galera_logs_2.txt).

It’s not the first time I have to reset a PXC cluster like that but I don’t understand why the last node created this situation.
Am I missing something ?

For information, we are using Debian with the following packages :

ii percona-xtradb-cluster-56 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster with Galera
ii percona-xtradb-cluster-client-5.6 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster database client binaries
ii percona-xtradb-cluster-common-5.6 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster database common files (e.g. /etc/mysql/my.cnf)
ii percona-xtradb-cluster-galera-3 3.14-1.wheezy amd64 Metapackage for latest version of galera3.
ii percona-xtradb-cluster-galera-3.x 3.14-1.wheezy amd64 Galera components of Percona XtraDB Cluster
ii percona-xtradb-cluster-server-5.6 5.6.29-25.15-1.wheezy amd64 Percona XtraDB Cluster database server binaries

I guess an upgrade of those versions is a must have here.

Configuration File :

[mysqld]

# Cluster configuration
wsrep_provider = /usr/lib/libgalera_smm.so
wsrep_forced_binlog_format = ROW
wsrep_cluster_address = gcomm://10.16.0.92,10.16.0.93,10.16.0.94
wsrep_slave_threads = 64
wsrep_sst_method = xtrabackup-v2
wsrep_sst_auth = XXXX:XXXX
wsrep_cluster_name = galera
wsrep_node_name = client
wsrep_node_address = 10.16.0.92
wsrep_causal_reads = OFF
wsrep_provider_options = "gcache.size = 50G; gcs.fc_limit = 64"

wsrep_retry_autocommit = 1
wsrep_debug = 0

Thanks for any information about that case.

galera_logs_1.txt (10.5 KB)

galera_logs_2.txt (4.67 KB)

2017-06-08 10:32:07 64450 [ERROR] WSREP: Certification failed for TO isolated action: source: 70435979-4b6e-11e7-86e1-ba8a94aea198 version: 3 local: 1 state: CERTIFYI35, d: -1, ts: 9335174392156268)

Something is making your nodes inconsistent but it’s not clear from your error logs. you can set wsrep_debug=1 on both nodes and maybe we can see more information.

Thank you for your answer, I did see that line but as it appears later in the message, I wasn’t sure it was the cause or not.
I’ll try to enable wsrep_debug but is the effect on performance important ? It is a production cluster with quite a lot of traffic so I have to be careful.

Besides that, do you think an upgrade of PXC and galera packages could help ?