2 Node cluster locks db on 1 end, fails to reconnect automatically

Hello,

I have a 2 node Percona cluster (percona-xtradb-cluster-56, 5.6.26-25.12-1.wheezy).

I had an issue where it seems db2 become unavailable due to some network issue.

While this happening db1 did not crash but locked down the database completely what freeradius was using. I guess this is normal behaviour.

During this time the database on db2 was accessible.

I have a feeling that there was not a long network outage between the nodes rather the auto-reconnect mechanism was failing because after I did restart db1 (just 4 minutes later of the last autoreconnect attempt) the cluster resynced and the dbs on db1 become accessible again.

My questions are:

1, How can I have the db at least in read only mode when the cluster split on both ends? In my case it would be useful for the radius to be still able to do authentication without updating infos in the database.

2, Can this be anyhow caused that my setup is using wsrep_sst_method=rsync instead of wsrep_sst_method=xtrabackup-v2?
I had no problem with this before.

3, How to increase the reconnect retry value to very high?

Logs:

http://pastebin.ca/3677566

The problem come up once again. It seems that the nodes lost connectivity, the mem usage and cpu load went up high on the second node then it just suddenly come back, no restart required this time. I had ping running from node1 -> node2 and had no packet loss at all so it might not be a network issue.

2016-08-19 13:57:27 24658 [Note] WSREP: (5fab4bf1, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers:

=====================================
2016-08-19 13:57:31 7f729615a700 INNODB MONITOR OUTPUT

Per second averages calculated from the last 4 seconds

BACKGROUND THREAD

srv_master_thread loops: 1977149 srv_active, 0 srv_shutdown, 6148 srv_idle
srv_master_thread log flush and writes: 1983165

SEMAPHORES

OS WAIT ARRAY INFO: reservation count 6052078
OS WAIT ARRAY INFO: signal count 6049543
Mutex spin waits 6425218, rounds 170770335, OS waits 5623549
RW-shared spins 329083, rounds 9871237, OS waits 328969
RW-excl spins 98958, rounds 2979655, OS waits 99116
Spin rounds per wait: 26.58 mutex, 30.00 RW-shared, 30.11 RW-excl

TRANSACTIONS

Trx id counter 66703779
Purge done for trx’s n:o < 65349398 undo n:o < 0 state: running but idle
History list length 632497
LIST OF TRANSACTIONS FOR EACH SESSION:
—TRANSACTION 66703520, not started

Nothing useful, after reconnecting it cleans up the transactions. Would an upgrade help anything on this? Since I have this cluster in production upgrading now is not that easy.

ii percona-xtrabackup 2.2.12-1.wheezy amd64 Open source backup tool for InnoDB and XtraDB
ii percona-xtradb-cluster-56 5.6.26-25.12-1.wheezy amd64 Percona XtraDB Cluster with Galera
ii percona-xtradb-cluster-client-5.6 5.6.26-25.12-1.wheezy amd64 Percona XtraDB Cluster database client binaries
ii percona-xtradb-cluster-common-5.6 5.6.26-25.12-1.wheezy amd64 Percona XtraDB Cluster database common files (e.g. /etc/mysql/my.cnf)
ii percona-xtradb-cluster-galera-3 3.12.2-1.wheezy amd64 Metapackage for latest version of galera3.
ii percona-xtradb-cluster-galera-3.x 3.12.2-1.wheezy amd64 Galera components of Percona XtraDB Cluster
ii percona-xtradb-cluster-server-5.6 5.6.26-25.12-1.wheezy amd64 Percona XtraDB Cluster database server binaries

I have discovered that in my situation the second node becomes unavailable due to high cpu load/mem usage by the mysql. It is not a network issue between the nodes. I had ping running for days between the 2 hosts and there is no packetloss at all.