Cluster down with 1/3 node down

Hi,

We installed and configured a cluster of 3 nodes. The synchronization is good but when I stop mysql on one node, all nodes are desynchronized and don’t accept new connections.

==================== Configuration of galera: ====================
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_cluster_name=“db_cluster”
wsrep_slave_threads=12
wsrep_certify_nonPK=1
wsrep_max_ws_rows=131072
wsrep_max_ws_size=1073741824
wsrep_debug=0
wsrep_convert_LOCK_to_trx=0
wsrep_retry_autocommit=1
wsrep_auto_increment_control=1
wsrep_replicate_myisam=1
wsrep_drupal_282555_workaround=0
wsrep_causal_reads=0
wsrep_sst_method=rsync

server-id=3
wsrep_node_address=192.168.10.3
wsrep_cluster_address=“gcomm://”
wsrep_provider_options=“pc.weight=0; gcache.size=8G; evs.keepalive_period=PT3S; evs.inactive_check_period=PT10S; evs.suspect_timeout=PT30S; evs.inactive_timeout=PT1M; evs.consensus_timeout=PT1M; evs.send_window=1024; evs.user_send_window=512;”

================================================== =========

Can you help us please ?

EDIT :

To add some information, here is the log I get on one of the desynchronised node (mysql still running) :

2014-02-05 16:02:05 19183 [Note] WSREP: view(view_id(NON_PRIM,e7516d17-8e6a-11e3-b85c-6a6eb0de5350,2) memb {
e7516d17-8e6a-11e3-b85c-6a6eb0de5350,0
} joined {
} left {
} partitioned {
fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9,0
})
2014-02-05 16:02:05 19183 [Note] WSREP: view(view_id(NON_PRIM,e7516d17-8e6a-11e3-b85c-6a6eb0de5350,3) memb {
e7516d17-8e6a-11e3-b85c-6a6eb0de5350,0
} joined {
} left {
} partitioned {
fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9,0
})
2014-02-05 16:02:05 19183 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2014-02-05 16:02:05 19183 [Note] WSREP: Flow-control interval: [16, 16]
2014-02-05 16:02:05 19183 [Note] WSREP: Received NON-PRIMARY.
2014-02-05 16:02:05 19183 [Note] WSREP: Shifting SYNCED → OPEN (TO: 192992574)
2014-02-05 16:02:05 19183 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2014-02-05 16:02:05 19183 [Note] WSREP: Flow-control interval: [16, 16]
2014-02-05 16:02:05 19183 [Note] WSREP: Received NON-PRIMARY.
2014-02-05 16:02:05 19183 [Note] WSREP: New cluster view: global state: 03b25294-7b07-11e3-ac2e-362fc6d31d98:192992574, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
2014-02-05 16:02:05 19183 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-02-05 16:02:05 19183 [Note] WSREP: New cluster view: global state: 03b25294-7b07-11e3-ac2e-362fc6d31d98:192992574, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
2014-02-05 16:02:05 19183 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-02-05 16:02:06 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.10.1:4567
2014-02-05 16:02:07 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, ‘tcp://0.0.0.0:4567’) reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 0
2014-02-05 16:02:52 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, ‘tcp://0.0.0.0:4567’) reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 30
2014-02-05 16:03:37 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, ‘tcp://0.0.0.0:4567’) reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 60
2014-02-05 16:04:22 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, ‘tcp://0.0.0.0:4567’) reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 90

So this node try to connect to a node wich is down instead of stay in the cluster alone.
To force him to connect to himself and make a one node cluster synchronised, I have to force it by issuing :
mysql> set global wsrep_cluster_address=“gcomm://”;

Up :slight_smile:

And to add some information again, I found a way to work around the problem by adding pc.ignore_sb = yes in wsrep_provider_options.

Does somebody have an idea on this please ?

Do not use split brain(pc.ignore_sb), unless its emergency.
How did you setup the cluster.? did you follow the standard procedure…? [URL=“Installing Percona XtraDB Cluster”]http://www.percona.com/doc/percona-x...tallation.html[/URL]

Try this…
disable pc.ignore_sb by commenting it out.
Double check the my.cnf configuration on all nodes, & set the gcomm values accordingly(replace node1,node2,node3 with their IPs).
node1 → [COLOR=#252C2F]gcomm://
node2 ->gcomm://node1,node2,node3
node3 ->gcomm://node1,node2,node3

Then after all nodes synched change the gcomm value of node1 to [COLOR=#252C2F]gcomm://node1,node2,node3 and restart mysql on that node1.

​To check if nodes are synced or not, login into the mysql prompt of any node and enter this command
show status like ‘wsrep%’;

Yeah I didn’t use pc.ignore_sb. It was just to try to be more explicit.

The thing is, I used to not mention the ip of the node in gcomm://, like this :

node1 → [COLOR=#252C2F]gcomm://
node2 ->gcomm://node1,node3
node3 ->gcomm://node1,node2

And yes, the nodes was synced using this configuration and checking via show status like ‘wsrep%’;
I will give a try to your config to see if there is some change.
I also upgraded to the last stable release and the problem is the same.

dpkg -l | grep percona

ii percona-toolkit 2.2.6 all Advanced MySQL and system command-line tools
ii percona-xtrabackup 2.1.7-721-1.wheezy amd64 Open source backup tool for InnoDB and XtraDB
ii percona-xtradb-cluster-client-5.6 5.6.15-25.3-711.wheezy amd64 Percona Server database client binaries
ii percona-xtradb-cluster-common-5.6 5.6.15-25.3-711.wheezy amd64 Percona Server database common files (e.g. /etc/mysql/my.cnf)
ii percona-xtradb-cluster-galera-3.x 189.wheezy amd64 Galera components of Percona XtraDB Cluster
ii percona-xtradb-cluster-server-5.6 5.6.15-25.3-711.wheezy amd64 Percona Server database server binaries

The reason I told to use IP’s is no need for DNS lookup, if DNS fails, then the nodes cannot see each other!, only thing u have to make sure is the IP’s should be static.
also check any firewall or other network issue that’s preventing these nodes to connect each other.