2 Nodes are going out of sync in 3 Nodes PXC set up at the time of re syncing

2 Nodes are going out of sync in 3 Nodes Percona XtraDB cluster set up at the time of re syncing:

Requesting you Guys in helping to re sync all 3 nodes.
Thanks you…

Used RPMS:
Percona-XtraDB-Cluster-shared-5.5.27-23.6.356.rhel6.x86_64
Percona-XtraDB-Cluster-server-5.5.27-23.6.356.rhel6.x86_64
percona-release-0.0-1.x86_64
Percona-XtraDB-Cluster-client-5.5.27-23.6.356.rhel6.x86_64
Percona-XtraDB-Cluster-galera-2.0-1.114.rhel6.x86_64
percona-xtrabackup-2.0.3-470.rhel6.x86_64

OS: CentOS release 6.3 (Final)
Environment: Virtual Systems.

Here is the mysql-error log from all 3 nodes:
Node 2: which is up
WSREP: FK key len exceeded 0 4294967295 3500
131227 2:58:46 [ERROR] WSREP: FK key set failed: 11
WSREP: FK key append failed

Node 3: is down
131227 5:00:11 [Note] WSREP: sst_donor_thread signaled with 0
131227 5:00:11 [Note] WSREP: Flushing tables for SST…
131227 5:00:11 [Note] WSREP: Provider paused at cf67b4da-6ea7-11e3-0800-7176739bc3d8:261
131227 5:00:11 [Note] WSREP: Tables flushed.
InnoDB: Warning: a long semaphore wait:
–Thread 139738020943616 has waited at trx0rseg.ic line 46 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f177f07a6b8 ‘&block->lock’
a writer (thread id 139738020943616) has reserved it in mode wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file buf0flu.c line 1319
Last time write locked in file /home/jenkins/workspace/percona-xtradb-cluster-rpms/label_exp/centos6-64/target/BUILD/Percona-XtraDB-Cluster-5.5.27/Percona-XtraDB-Cluster-5.5.27/storage/innobase/include/trx0rseg.ic line 46
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:


SEMAPHORES

OS WAIT ARRAY INFO: reservation count 46, signal count 44
–Thread 139738020943616 has waited at trx0rseg.ic line 46 for 271.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f177f07a6b8 ‘&block->lock’
a writer (thread id 139738020943616) has reserved it in mode wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file buf0flu.c line 1319
Last time write locked in file /home/jenkins/workspace/percona-xtradb-cluster-rpms/label_exp/centos6-64/target/BUILD/Percona-XtraDB-Cluster-5.5.27/Percona-XtraDB-Cluster-5.5.27/storage/innobase/include/trx0rseg.ic line 46
Mutex spin waits 38, rounds 925, OS waits 30
RW-shared spins 15, rounds 432, OS waits 14
RW-excl spins 1, rounds 60, OS waits 2
Spin rounds per wait: 24.34 mutex, 28.80 RW-shared, 60.00 RW-excl

RANSACTIONS

Trx id counter A0E406071
Purge done for trx’s n:o < A0E40606E undo n:o < 0
History list length 618
LIST OF TRANSACTIONS FOR EACH SESSION:
—TRANSACTION A0E40606E, not started
MySQL thread id 3, OS thread handle 0x7f174a757700, query id 2974 committed 260
—TRANSACTION A0E406070, not started
MySQL thread id 1, OS thread handle 0x7f1b16edb700, query id 2976 committed 261

END OF INNODB MONITOR OUTPUT

InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Warning: a long semaphore wait:
–Thread 139738020943616 has waited at trx0rseg.ic line 46 for 303.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f177f07a6b8 ‘&block->lock’
a writer (thread id 139738020943616) has reserved it in mode wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file buf0flu.c line 1319
Last time write locked in file /home/jenkins/workspace/percona-xtradb-cluster-rpms/label_exp/centos6-64/target/BUILD/Percona-XtraDB-Cluster-5.5.27/Percona-XtraDB-Cluster-5.5.27/storage/innoba
se/include/trx0rseg.ic line 46
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0

Node 1: down
131227 4:49:46 [Note] WSREP: 1 (Node3): State transfer from 0 (Node1) complete.
131227 4:49:46 [Note] WSREP: Member 1 (Node3) synced with group.
05:00:03 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

131227 5:11:34 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
131227 5:11:34 [Note] WSREP: forgetting 49cd72df-6eb2-11e3-0800-3db8fd926ddb (tcp://XXX.XXX.XXX.53-Node3:4567)
131227 5:11:34 [Note] WSREP: (bf5de37d-6eb3-11e3-0800-1b8b698cefc9, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
131227 5:11:34 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 5ac327af-6eb5-11e3-0800-8a7f196d2532
131227 5:11:34 [Note] WSREP: STATE EXCHANGE: sent state msg: 5ac327af-6eb5-11e3-0800-8a7f196d2532
131227 5:11:34 [Note] WSREP: STATE EXCHANGE: got state msg: 5ac327af-6eb5-11e3-0800-8a7f196d2532 from 0 (Node1)
131227 5:11:34 [Note] WSREP: STATE EXCHANGE: got state msg: 5ac327af-6eb5-11e3-0800-8a7f196d2532 from 1 (Node2)
131227 5:11:34 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 4,
members = 1/2 (joined/total),
act_id = 864,
last_appl. = 835,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = cf67b4da-6ea7-11e3-0800-7176739bc3d8
131227 5:11:34 [Warning] WSREP: Donor 49cd72df-6eb2-11e3-0800-3db8fd926ddb is no longer in the group. State transfer cannot be completed, need to abort. Aborting…
131227 5:11:34 [Note] WSREP: /usr/sbin/mysqld: Terminated.
131227 05:11:34 mysqld_safe mysqld from pid file /mnt/data//Node1.pid ended

131227 5:24:10 [Note] WSREP: Assign initial position for certification: 960, protocol version: 2
131227 5:24:10 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (cf67b4da-6ea7-11e3-

131227 5:25:53 [Note] WSREP: Quorum results:
version = 2,
component = NON-PRIMARY,
conf_id = -1,
members = 1/1 (joined/total),
act_id = -1,
last_appl. = -1,
protocols = -1/-1/-1 (gcs/repl/appl),
group UUID = 00000000-0000-0000-0000-000000000000
131227 5:25:53 [Note] WSREP: Flow-control interval: [8, 16]
131227 5:25:53 [Note] WSREP: Received NON-PRIMARY.
131227 5:25:53 [Note] WSREP: Shifting JOINER -> OPEN (TO: 961)
131227 5:25:59 [Note] WSREP: cleaning up f9d65922-6eb6-11e3-0800-4de8ca27dd9e (tcp://XXX.XXX.XXX.52-Node2:4567)


Node 2 my.cnf: mysqld section: which is up
[mysqld]

GENERAL

user = mysql
default_storage_engine = InnoDB

server_id=1
wsrep_cluster_address=gcomm://
wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_slave_threads=2
wsrep_cluster_name= ecomm
wsrep_sst_method=rsync
wsrep_node_name=Node2
wsrep_sst_receive_address=XXX.XXX.XXX.52-Node2

MyISAM

key_buffer_size = 32M
myisam_recover = FORCE,BACKUP

SAFETY

max_allowed_packet = 64M
max_connect_errors = 1000000
skip_name_resolve
sql_mode = STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_
ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE,ONLY_FULL_GROUP_BY
sysdate_is_now = 1
innodb = FORCE
innodb_strict_mode = 1

DATA STORAGE

datadir = /mnt/data/

BINARY LOGGING

log_bin = /mnt/data/mysql-bin
expire_logs_days = 14
sync_binlog = 1
binlog_format = ROW

CACHES AND LIMITS

tmp_table_size = 128M
max_heap_table_size = 128M
query_cache_type = 0
query_cache_size = 8
max_connections = 2010
thread_cache_size = 50
open_files_limit = 65535
table_definition_cache = 4096
table_open_cache = 12000

INNODB

innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 256M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 14G
innodb_locks_unsafe_for_binlog = 1
innodb_autoinc_lock_mode = 2
wait_timeout = 1500
interactive_timeout = 1500


Node 1 my.cnf: mysqld section: which is down
[mysqld]

GENERAL

user = mysql
default_storage_engine = InnoDB

server_id=2
wsrep_cluster_address=gcomm://XXX.XXX.XXX.52-Node2
wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_slave_threads=2
wsrep_cluster_name= ecomm
wsrep_sst_method=rsync
wsrep_node_name=Node1
wsrep_sst_receive_address=XXX.XXX.XXX.51-Node1

MyISAM

key_buffer_size = 32M
myisam_recover = FORCE,BACKUP

SAFETY

max_allowed_packet = 64M
max_connect_errors = 1000000
skip_name_resolve
sql_mode = STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_
ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE,ONLY_FULL_GROUP_BY
sysdate_is_now = 1
innodb = FORCE
innodb_strict_mode = 1

DATA STORAGE

datadir = /mnt/data/

BINARY LOGGING

log_bin = /mnt/data/mysql-bin
expire_logs_days = 14
sync_binlog = 1
binlog_format = ROW

CACHES AND LIMITS

tmp_table_size = 128M
max_heap_table_size = 128M
query_cache_type = 0
query_cache_size = 8
max_connections = 2010
thread_cache_size = 50
open_files_limit = 65535
table_definition_cache = 4096
table_open_cache = 12000

INNODB

innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 256M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 14G
innodb_locks_unsafe_for_binlog = 1
innodb_autoinc_lock_mode = 2
wait_timeout = 1500
interactive_timeout = 1500


Try changing two things: PXC never version then 5.5.27 (latest if possible), and wsrep_sst_method=xtrabackup (much less locking then rsync).

Hi Przemek, Thanks for the suggestion.
Actually all these days, all the 3 nodes are working quite fine. Now and then nodes went out of sync and we used to take down time and resync the nodes. After this cluster used to come back to normalcy.
But in the latest scenario, 2 nodes went out of sync. Actions done are below.

  1. Took downtime 2) resynced the nodes from surviving Node 2 , resync successfully completed 3) created a blank schema in one node and verified the same in other nodes. blank schema synchronized in other nodes also. i.e, OK. 4) After 15 minutes, nodes went out of sync again. ERRORS are posted above.
    From the errors, errors are different from each node which are out of sync.(Node 1 & Node 3)
    Could u pl check these errors and suggest what can be done to fix and bring nodes to sync and stay up and running. Our problem is nodes goes out of sync very frequently.

As a long term action, we can upgrade the PXC and galera to latest version.
But for immediate action, any suggestions , so that 3 node cluster comes back to working status as it was working previously.

Also to give info on the cluster type, this is a multi master 3 node cluster ( all are masters).
Please suggest…

Thanks Przemek.

Just to check, Galera 2.0 does not support [COLOR=#252C2F]wsrep_sst_method=xtrabackup (pls correct me, If i’m wrong).

Also we are using 5 HAProxy clients in 5 App system for Application connection to cluster database.

Also write always happens to 1 node only from all 5 APP(HAProxy), to avoid deadlock.

Poorna PC,

[COLOR=#252C2F]wsrep_sst_method=xtrabackup - it’s ok for galera 2.0
you can read about pros and cons on codership site:
http://www.codership.com/wiki/doku.php?id=sst_mysql

the one of errors is similar to error described in bug:
https://bugs.launchpad.net/codership-mysql/+bug/1057910

The bug fixed in 5.5.28 version.
So I’d suggest to upgrade.

Thanks Mixa…