two-node cluster hangs when primary node is gone (ignore-sb=true)

Hi,
I have a two node Percona XtraDB Cluster (v5.5.33) with Galera setup. Galera has been configured to ignore split-brain.

When I perform failover tests for these nodes then I see strange behaviour which I cant get a grip on.

The following is the scenario:
Node2 was the cluster creator. It has wsrep_cluster_address of “gcomm://”, Node1 has address “gcomm://10.0.100.2” when I start this scenario.
Node1 (10.0.100.1) and node2 (10.0.100.2) are both running and only node 1 is receiving data.
When I reboot node1 then node2 detects this and it happily receives data. No problem here. Node1 comes back up and performs IST to rejoin the cluster.
All still well.
When the IST of node1 is finished and the node is ready then after several minutes I stop node2. As soon as node2 is busy with stopping Percona then node1 hangs all transactions.
Doing a “show status like ‘wsrep%’;” shows me that node 1 ‘believes’ its still part of the cluster and does not seem to detect that the 2nd node is gone.

I’m using all innoDB tables and have a high-load on the server. Several TBs of data with 60GB configured innodb buffer pool size.

I also tried to do a "SET GLOBAL wsrep_cluster_address=‘gcomm://’; " to force node2 to be the cluster creator. But alas, it does not solve the issue described above.

Why is the node hanging? and more importantly how can I fix this?

Many Thanks!

My my.cnf (config of node1, node2 is the same apart from ofcourse IP-addresses) looks like :

GENERAL

user = mysql
default-storage-engine = innodb
socket = /data/mysql/mysql.sock
pid-file = /data/mysql/mysql.pid

slow-query-log = ON
log-queries-not-using-indexes = ON
innodb_print_all_deadlocks = ON

max_allowed_packet = 120M
max_connect_errors = 2000000000000
skip-name-resolve

sysdate-is-now = 1
innodb = FORCE
innodb-strict-mode = 1

datadir = /data/mysql
tmpdir = /data/mysql-tmp

log-bin = /data/mysql/mysql-bin
expire-logs-days = 5
sync-binlog = 1

log-slave-updates = 1
relay-log = /data/mysql/relay-bin
slave-net-timeout = 60
sync-master-info = 1
sync-relay-log = 1
sync-relay-log-info = 1

tmp-table-size = 32M
max-heap-table-size = 32M
query-cache-type = 0
query-cache-size = 0

max-connections = 1000
thread-cache-size = 50
open-files-limit = 65535
table-definition-cache = 1024
table-open-cache = 1000

innodb-flush-method = O_DIRECT
innodb-log-files-in-group = 2
innodb-log-file-size = 512M
innodb-flush-log-at-trx-commit = 1
innodb-file-per-table = 1

innodb-buffer-pool-size = 60G

server-id = 1
binlog_format=ROW

innodb_autoinc_lock_mode=2
innodb_locks_unsafe_for_binlog=1
bind-address=0.0.0.0

wsrep_provider=“/usr/lib/libgalera_smm.so”
wsrep_provider_options=“pc.ignore_sb = yes; evs.keepalive_period = PT1S; evs.inactive_check_period = PT1S; evs.suspect_timeout = PT5S; evs.inactive_timeout = PT10S; evs.install_timeout = PT10S; gcache.size=32G”

wsrep_cluster_name=“percona_cluster”
wsrep_cluster_address=gcomm://10.0.100.2

wsrep_node_name=node1
wsrep_node_address=10.0.100.1

wsrep_slave_threads=16

wsrep_certify_nonPK=1
wsrep_max_ws_rows=131072
wsrep_max_ws_size=1073741824
wsrep_debug=0
wsrep_convert_LOCK_to_trx=0
wsrep_retry_autocommit=1
wsrep_auto_increment_control=1
wsrep_drupal_282555_workaround=0

wsrep_causal_reads=0
wsrep_notify_cmd=

wsrep_sst_method=xtrabackup
wsrep_sst_auth=mysql_sst:*********

Desired SST donor name.

#wsrep_sst_donor=

Reject client queries when donating SST (false)

#wsrep_sst_donor_rejects_queries=0

Protocol version to use

wsrep_protocol_version=

If you had bootstrapped node2 then try starting again first node2(let it start completely) and then node1,
how did you come to conclusion that it was hanging!?, you say its high load, and with innodb you can expect slowness.
What the mysql error log says…?