Lost server synchronisation in a three node cluster

Hi,

I have a three node cluster, and I lost synchronisation regularly with one of the server. Now I have corruption in the InnoDB tablespace, I feel uncomfortable to run a innodb_force_recovery=6 because I have 2 important production dabatases on these servers. I retired the failed server from the cluster, reinstall a new one, and all the symptoms reappear : synchronisation lost, and corruption once again.
Servers are Ubuntu 10.04.4 LTS, and I use percona-xtradb-cluster-server-5.5 (version 5.5.31-23.7.5-438).

You can find the error log in the attached file, and below the my.cnf file :


[client]
password = 'xxxxx'
port = 3306
socket = /var/run/mysqld/mysqld.sock

[mysqld_safe]
wsrep_urls=gcomm://192.168.183.40:4567,gcomm://192.168.183.41:4567,gcomm://192.168.183.42:4567

[mysqld]
datadir=/var/lib/mysql
user=mysql

binlog_format=ROW

wsrep_provider=/usr/lib64/libgalera_smm.so

wsrep_slave_threads=2
wsrep_cluster_name=prod_pa
wsrep_sst_method=rsync
wsrep_node_name=lxpadb03

default_storage_engine=InnoDB
innodb_locks_unsafe_for_binlog=1
innodb_autoinc_lock_mode=2

#tuning
max_allowed_packet = 16M
max_connect_errors = 1000000
skip_name_resolve
query_cache_size=0
query_cache_type=0
tmp_table_size = 32M
max_heap_table_size = 32M
max_connections = 500
thread_cache_size = 50
open_files_limit = 65535
table_definition_cache = 4096
table_open_cache = 4096
# INNODB #
innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 256M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 3072M

Thanks in advance.
Laeti

lxpadb03.err.zip (60 KB)

So this Innodb corruption happened only on this single node? And by a reinstalling a node you mean it’s a new install on the same machine or completely different machine?
If only single machine shows data corruption I would check dmesg & /var/log/syslog for any signs of disk or memory errors. Memcheck would be good to have too.
If all nodes are experiencing data corruption, I would try taking of them off the cluster and mysldump all data if possible, probably in one of the innodb_force_recovery modes.

Yes the Innodb corruption happens only on this node. When I said reinstalling, I created a new virtual machine, but only kept the same hostname and ip address. The three servers are virtual machines.
There is nothing in dmesg and /var/log/syslog.
What I noticed in /var/log/syslog is that the synchronisation is not done one the second database. I see these kind of lines : rsync to rsync_sst/./mysqlslap or rsync to rsync_sst/./performance_schema, rsync to rsync_sst/./private_prod but never for toplink_prod database.

Thanks.
Laeti

Any other differences between this failing node and two other nodes? All living on the same host server?
There is also a chance the data is corrupted on source node from which SST was performed. I would suggest trying Percona XtraBackup from the lxpadb01 node and check how preparing and using this backup on another test host works.