cluster crashes on Node Crash

Hi

I setup the latest copy of the xtradb cluster on 4 servers last night.

The first machine contained an existing db of 2 Gb and was started with gcomm://

I added two more machines to the cluster and both got synced (sst_mode=xtradbbackup)without any issues.

This setup worked fine with decent amount of read\write load for a couple of hours.All writes were still being sent to the original DB server(master node)

I then added another node which also synced with the cluster seemlessly

Now since an even number of nodes is not recommended and because i wanted to test how the cluster responds when a node crashes,i killed the mysqld process on the 4th Node.

Unfortunately this caused all sort of havoc.
All the three other nodes went in a weird state and most sql commands stopped failing.
even the “use dbname” sql query returned a command not found error

Looking at the logs of all the nodes, i found that all nodes had lost connectivity to each other and were stuck ina state of trying to reconnect infinitely.
pasting logs below from node 3

121019 4:18:29 [Note] WSREP: Flow-control interval: [8, 16]
121019 4:18:29 [Note] WSREP: Received NON-PRIMARY.
121019 4:18:29 [Note] WSREP: New cluster view: global state: eb1d8efb-1967-11e2
-0800-d9cf063d7dbe:86293, view# -1: non-Primary, number of nodes: 1, my index: 0
, protocol version 2
121019 4:18:29 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notifica
tion.
121019 4:18:31 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
:4567), attempt 0
121019 4:18:48 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x2
:4567), attempt 1380
121019 4:19:16 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
:4567), attempt 30
121019 4:19:33 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x2
:4567), attempt 1410
121019 4:19:57 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to 3911cf04-1975-11e2-0800-0427b9d45b0b (tcp://xx.xx.xx.x1
.212:4567), attempt 90
121019 4:20:01 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
:4567), attempt 60
121019 4:20:18 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x4
:4567), attempt 1440
121019 4:20:46 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
:4567), attempt 90
121019 4:21:03 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x4
:4567), attempt 1470
121019 4:21:31 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
:4567), attempt 120
121019 4:21:48 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x4
:4567), attempt 1500
121019 4:21:57 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to 3911cf04-1975-11e2-0800-0427b9d45b0b (tcp://xx.xx.xx.x1
:4567), attempt 120
121019 4:22:16 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
:4567), attempt 150
121019 4:22:33 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to d784b124-1970-11e2-0800-073823812a38 (tcp://xx.xx.xx.x4
:4567), attempt 1530
121019 4:23:01 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to 4a29ea43-1968-11e2-0800-ef510190974f (tcp://xx.xx.xx.x2
:4567), attempt 180
121019 4:23:18 [Note] WSREP: (27914b3f-1969-11e2-0800-946a9ab9a970, ‘tcp://0.0.
0.0:4567’) reconnecting to d784b124-1970-

I tried restarting all nodes but that did not help.

In the end i had to reinitialize the original master by using gcomm://

This isnt somethign that we will expect from a cluster.If the crash of a node crashes the whole cluster, then it takes away the need of a cluster.

What could be the cause of this

I use Ubuntu 12 and installed all the software ust apt-get directly from percona repository.
All the 4 servers were setup within the course of 1-2 hours

copy of my.cnf

[mysqld]
datadir=/var/lib/mysql
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_cluster_address=gcomm://xx.xx.xx.x2
wsrep_slave_threads=8
#wsrep_sst_method=rsync
wsrep_sst_method=xtrabackup
wsrep_replicate_myisam=1
wsrep_cluster_name=my_db_cluster
wsrep_node_name=web1
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_locks_unsafe_for_binlog=1
wsrep_sst_auth=root:XXXXXddXX