Cluster down with 1/3 node down

Delard · February 5, 2014, 8:19am

Hi,

We installed and configured a cluster of 3 nodes. The synchronization is good but when I stop mysql on one node, all nodes are desynchronized and don’t accept new connections.

==================== Configuration of galera: ====================
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_cluster_name=“db_cluster”
wsrep_slave_threads=12
wsrep_certify_nonPK=1
wsrep_max_ws_rows=131072
wsrep_max_ws_size=1073741824
wsrep_debug=0
wsrep_convert_LOCK_to_trx=0
wsrep_retry_autocommit=1
wsrep_auto_increment_control=1
wsrep_replicate_myisam=1
wsrep_drupal_282555_workaround=0
wsrep_causal_reads=0
wsrep_sst_method=rsync

server-id=3
wsrep_node_address=192.168.10.3
wsrep_cluster_address=“gcomm://”
wsrep_provider_options=“pc.weight=0; gcache.size=8G; evs.keepalive_period=PT3S; evs.inactive_check_period=PT10S; evs.suspect_timeout=PT30S; evs.inactive_timeout=PT1M; evs.consensus_timeout=PT1M; evs.send_window=1024; evs.user_send_window=512;”

================================================== =========

Can you help us please ?

EDIT :

To add some information, here is the log I get on one of the desynchronised node (mysql still running) :

2014-02-05 16:02:05 19183 [Note] WSREP: view(view_id(NON_PRIM,e7516d17-8e6a-11e3-b85c-6a6eb0de5350,2) memb {
e7516d17-8e6a-11e3-b85c-6a6eb0de5350,0
} joined {
} left {
} partitioned {
fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9,0
})
2014-02-05 16:02:05 19183 [Note] WSREP: view(view_id(NON_PRIM,e7516d17-8e6a-11e3-b85c-6a6eb0de5350,3) memb {
e7516d17-8e6a-11e3-b85c-6a6eb0de5350,0
} joined {
} left {
} partitioned {
fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9,0
})
2014-02-05 16:02:05 19183 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2014-02-05 16:02:05 19183 [Note] WSREP: Flow-control interval: [16, 16]
2014-02-05 16:02:05 19183 [Note] WSREP: Received NON-PRIMARY.
2014-02-05 16:02:05 19183 [Note] WSREP: Shifting SYNCED → OPEN (TO: 192992574)
2014-02-05 16:02:05 19183 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2014-02-05 16:02:05 19183 [Note] WSREP: Flow-control interval: [16, 16]
2014-02-05 16:02:05 19183 [Note] WSREP: Received NON-PRIMARY.
2014-02-05 16:02:05 19183 [Note] WSREP: New cluster view: global state: 03b25294-7b07-11e3-ac2e-362fc6d31d98:192992574, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
2014-02-05 16:02:05 19183 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-02-05 16:02:05 19183 [Note] WSREP: New cluster view: global state: 03b25294-7b07-11e3-ac2e-362fc6d31d98:192992574, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 2
2014-02-05 16:02:05 19183 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-02-05 16:02:06 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.10.1:4567
2014-02-05 16:02:07 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, ‘tcp://0.0.0.0:4567’) reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 0
2014-02-05 16:02:52 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, ‘tcp://0.0.0.0:4567’) reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 30
2014-02-05 16:03:37 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, ‘tcp://0.0.0.0:4567’) reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 60
2014-02-05 16:04:22 19183 [Note] WSREP: (e7516d17-8e6a-11e3-b85c-6a6eb0de5350, ‘tcp://0.0.0.0:4567’) reconnecting to fc04cf52-8e6a-11e3-b0f9-93a4b1f2a1d9 (tcp://192.168.10.1:4567), attempt 90

So this node try to connect to a node wich is down instead of stay in the cluster alone.
To force him to connect to himself and make a one node cluster synchronised, I have to force it by issuing :
mysql> set global wsrep_cluster_address=“gcomm://”;

Delard · February 10, 2014, 6:06am

Up

And to add some information again, I found a way to work around the problem by adding pc.ignore_sb = yes in wsrep_provider_options.

Does somebody have an idea on this please ?

madhusudan · February 10, 2014, 6:43am

Do not use split brain(pc.ignore_sb), unless its emergency.
How did you setup the cluster.? did you follow the standard procedure…? [URL=“Installing Percona XtraDB Cluster”]http://www.percona.com/doc/percona-x...tallation.html[/URL]

Try this…
disable pc.ignore_sb by commenting it out.
Double check the my.cnf configuration on all nodes, & set the gcomm values accordingly(replace node1,node2,node3 with their IPs).
node1 → [COLOR=#252C2F]gcomm://
node2 ->gcomm://node1,node2,node3
node3 ->gcomm://node1,node2,node3

Then after all nodes synched change the gcomm value of node1 to [COLOR=#252C2F]gcomm://node1,node2,node3 and restart mysql on that node1.

To check if nodes are synced or not, login into the mysql prompt of any node and enter this command
show status like ‘wsrep%’;

Delard · February 10, 2014, 9:13am

Yeah I didn’t use pc.ignore_sb. It was just to try to be more explicit.

The thing is, I used to not mention the ip of the node in gcomm://, like this :

node1 → [COLOR=#252C2F]gcomm://
node2 ->gcomm://node1,node3
node3 ->gcomm://node1,node2

And yes, the nodes was synced using this configuration and checking via show status like ‘wsrep%’;
I will give a try to your config to see if there is some change.
I also upgraded to the last stable release and the problem is the same.

dpkg -l | grep percona

ii percona-toolkit 2.2.6 all Advanced MySQL and system command-line tools
ii percona-xtrabackup 2.1.7-721-1.wheezy amd64 Open source backup tool for InnoDB and XtraDB
ii percona-xtradb-cluster-client-5.6 5.6.15-25.3-711.wheezy amd64 Percona Server database client binaries
ii percona-xtradb-cluster-common-5.6 5.6.15-25.3-711.wheezy amd64 Percona Server database common files (e.g. /etc/mysql/my.cnf)
ii percona-xtradb-cluster-galera-3.x 189.wheezy amd64 Galera components of Percona XtraDB Cluster
ii percona-xtradb-cluster-server-5.6 5.6.15-25.3-711.wheezy amd64 Percona Server database server binaries

madhusudan · February 11, 2014, 1:19am

The reason I told to use IP’s is no need for DNS lookup, if DNS fails, then the nodes cannot see each other!, only thing u have to make sure is the IP’s should be static.
also check any firewall or other network issue that’s preventing these nodes to connect each other.

Topic		Replies	Views
All nodes in the cluster becomes inaccessible Percona XtraDB Cluster 5.x	9	5470	July 31, 2014
problem with sync Percona XtraDB Cluster 5.x	2	825	March 29, 2013
cluster crashes on Node Crash Percona XtraDB Cluster 5.x	0	493	October 19, 2012
MySQL stops handling requests when restarting mysql on other nodes --- donor/desync Percona XtraDB Cluster 5.x	4	3930	July 1, 2014
Percona Cluster node goes down. Percona XtraDB Cluster 5.x	1	2161	April 23, 2014

Cluster down with 1/3 node down

dpkg -l | grep percona

Related topics