PXC 5.6 crashes while blacklisting some IPs

Hello Folks,

We have a setup that is like this :
3 debian servers
Those are VMS, on 2 differents physical server (2 on one side, 1 on the other)
We are running PXC 5.6 from the percona repo
All writes goes to one server, the other are hot standby, keepalived manages the virtual IP.

The problem is, at some point in the week there seems to be a replication issue, leading to the nodes resnchronizing from the master, and those events goes for some time until the moment where the master crashes.
What I can see in the log are something like this :

The log from the primary got erased unfortunately.

On one of the ‘slave’ :

2014-04-12 00:36:45 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping
2014-04-12 00:36:45 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.10.10.14:4567
2014-04-12 00:36:45 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping
2014-04-12 00:36:46 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping
2014-04-12 00:36:46 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') reconnecting to 96e20aae-bd97-11e3-b30b-a65820d5fee1 (tcp://10.10.10.14:4567), attempt 0
2014-04-12 00:36:46 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') cleaning up duplicate 0x2a77d60 after established 0x2a89000
2014-04-12 00:36:46 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping
2014-04-12 00:36:46 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping
2014-04-12 00:36:46 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping
2014-04-12 00:36:46 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') turning message relay requesting off
2014-04-12 00:36:48 3497 [Note] WSREP: evs::proto(9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, GATHER, view_id(REG,07afbea5-c191-11e3-a533-7fe32c71cd52,24)) suspecting node: 96e20aae-bd97-11e3-b30b-a65820d5fee1
2014-04-12 00:36:48 3497 [Note] WSREP: evs::proto(9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, GATHER, view_id(REG,07afbea5-c191-11e3-a533-7fe32c71cd52,24)) suspecting node: 96e20aae-bd97-11e3-b30b-a65820d5fee1
2014-04-12 00:36:49 3497 [Note] WSREP: declaring 07afbea5-c191-11e3-a533-7fe32c71cd52 stable
2014-04-12 00:36:49 3497 [Note] WSREP: Node 07afbea5-c191-11e3-a533-7fe32c71cd52 state prim
2014-04-12 00:36:49 3497 [Note] WSREP: view(view_id(PRIM,07afbea5-c191-11e3-a533-7fe32c71cd52,25) memb {
07afbea5-c191-11e3-a533-7fe32c71cd52,0
9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b,0
} joined {
} left {
} partitioned {
96e20aae-bd97-11e3-b30b-a65820d5fee1,0
})
2014-04-12 00:36:49 3497 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2014-04-12 00:36:49 3497 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2014-04-12 00:36:49 3497 [Note] WSREP: forgetting 96e20aae-bd97-11e3-b30b-a65820d5fee1 (tcp://10.10.10.14:4567)
2014-04-12 00:36:49 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping
2014-04-12 00:36:49 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping
2014-04-12 00:36:49 3497 [Note] WSREP: STATE EXCHANGE: sent state msg: c570a799-c1c9-11e3-b3c1-eff1874c545a
2014-04-12 00:36:49 3497 [Note] WSREP: STATE EXCHANGE: got state msg: c570a799-c1c9-11e3-b3c1-eff1874c545a from 0 (mysql3)
2014-04-12 00:36:49 3497 [Note] WSREP: STATE EXCHANGE: got state msg: c570a799-c1c9-11e3-b3c1-eff1874c545a from 1 (mysql1)
2014-04-12 00:36:49 3497 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = 19,
members = 2/2 (joined/total),
act_id = 1527991,
last_appl. = 1527915,
protocols = 0/5/2 (gcs/repl/appl),
group UUID = 36a9013d-7860-11e3-0800-7fc3d2bd6c58
2014-04-12 00:36:49 3497 [Note] WSREP: Flow-control interval: [23, 23]
2014-04-12 00:36:49 3497 [Note] WSREP: New cluster view: global state: 36a9013d-7860-11e3-0800-7fc3d2bd6c58:1527991, view# 20: Primary, number of nodes: 2, my index: 1, protocol version 2
2014-04-12 00:36:49 3497 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-04-12 00:36:49 3497 [Note] WSREP: REPL Protocols: 5 (3, 1)
2014-04-12 00:36:49 3497 [Note] WSREP: Assign initial position for certification: 1527991, protocol version: 3
2014-04-12 00:36:49 3497 [Note] WSREP: Service thread queue flushed.
2014-04-12 00:36:50 3497 [Warning] WSREP: discarding established (time wait) 96e20aae-bd97-11e3-b30b-a65820d5fee1 (tcp://10.10.10.14:4567)
2014-04-12 00:36:50 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping
2014-04-12 00:36:52 3497 [Warning] WSREP: discarding established (time wait) 96e20aae-bd97-11e3-b30b-a65820d5fee1 (tcp://10.10.10.14:4567)
2014-04-12 00:36:52 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping
2014-04-12 00:36:53 3497 [Warning] WSREP: discarding established (time wait) 96e20aae-bd97-11e3-b30b-a65820d5fee1 (tcp://10.10.10.14:4567)
[...]
2014-04-12 00:36:55 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping
2014-04-12 00:36:55 3497 [Warning] WSREP: evs::proto(9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, GATHER, view_id(REG,07afbea5-c191-11e3-a533-7fe32c71cd52,25)) source 96e20aae-bd97-11e3-b30b-a65820d5fee1 is not supposed to be representative
2014-04-12 00:36:57 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, 'tcp://0.0.0.0:4567') address 'tcp://10.10.10.13:4567' pointing to uuid 9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b is blacklisted, skipping

my config looks like this :

root@mysql1:~# cat /etc/mysql/my.cnf
# The following options will be passed to all MySQL clients
[client]
#password = your_password
port = 3306
socket = /var/run/mysqld/mysqld.sock

# Here follows entries for some specific programs


[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0

[mysqld]
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock

user = mysql
default_storage_engine = InnoDB
datadir = /storage/mysql/

port = 3306
socket = /tmp/mysql.sock
skip-external-locking
skip-name-resolve
key_buffer_size = 384M
max_allowed_packet = 16M
table_open_cache = 512
sort_buffer_size = 4M
read_buffer_size = 4M
read_rnd_buffer_size = 8M
myisam_sort_buffer_size = 64M
thread_cache_size = 8
#query_cache_size = 128M
# Try number of CPU's*2 for thread_concurrency
thread_concurrency = 8

#skip-networking

# BINARY LOGGING #
#log-bin = /var/lib/mysql/mysql-bin
#expire-logs-days = 14
#sync-binlog = 1
# In order for Galera to work correctly binlog format should be ROW
binlog_format=ROW


# Uncomment the following if you are using InnoDB tables
innodb_data_home_dir = /storage/mysql/
innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend
innodb_log_group_home_dir = /storage/mysql/
# You can set .._buffer_pool_size up to 50 - 80 %
# of RAM but beware of setting memory usage too high
innodb_buffer_pool_size = 2G
innodb_additional_mem_pool_size = 20M
# Set .._log_file_size to 25 % of buffer pool size
innodb_log_file_size = 500M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50


# LOGGING #
log_error = /storage/mysql/mysql-error.log
log_queries_not_using_indexes = 1
slow_query_log = 1
slow_query_log_file = /storage/mysql/mysql-slow.log
slow_query_log_use_global_control = all

## xtradb
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_cluster_address=gcomm://10.10.10.13:4567,10.10.10.14:4567,10.10.10.16:4567
wsrep_node_address=10.10.10.13
#wsrep_slave_threads=2
wsrep_cluster_name=sqmCluster
wsrep_sst_method=xtrabackup-v2
wsrep_node_name=mysql1
wsrep_sst_auth="USER:PASS"

### auto increment issue
#auto_increment_offset=1
#auto_increment_increment=1
wsrep_auto_increment_control=OFF
###
innodb_locks_unsafe_for_binlog=1
innodb_autoinc_lock_mode=2


[mysqldump]
quick
max_allowed_packet = 16M

[mysql]
no-auto-rehash
# Remove the next comment character if you are not familiar with SQL
#safe-updates

[myisamchk]
key_buffer_size = 256M
sort_buffer_size = 256M
read_buffer = 2M
write_buffer = 2M

[mysqlhotcopy]
interactive-timeout

I would like to add that there is no network issue as far as I can tell, those two physical servers are connected directly back to back.

Thanks for you help,

Thomas

No one from percona around ?

The warnings about blacklisting are normal and are related the the node seeing itself and excluding himself from communication negotiations. While the warnings appear mostly at a moment where cluster configuration changes, which in this case was the node 10.10.10.14 considered as not available:
2014-04-12 00:36:45 3497 [Note] WSREP: (9e6115f5-c1aa-11e3-8d85-7a0bafd1c35b, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.10.10.14:4567
This looks to be network problem where connection with 10.10.10.14 is not reliable hence two other cluster members marked it as partitioned node.