Network split caused a quorum number of members to be non-prim

bradh352 · October 27, 2018, 8:27pm

Using Percona-XtraDB-Cluster-server-57-5.7.23-31.31.2.el7.x86_64 in a WAN-based setup with 3 datacenters (jax, clt, atl). 2 datacenters (jax, clt) have 2 nodes each, the third datacenter (atl) just runs garbd. These dcs were chosen because they have low latency interconnects between them (<12ms).

Our jax DC had a significant failure, and it was expected that since clt and atl were still online that it would have quorum and continue. In this past this has been true but somehow with this particular failure it was not. I can’t really tell why from the logs. But once jax came back online, everything came back up.

Would it help somehow if atl was a full database instance instead of garbd?

I’ve attached the logs for one clt node and the atl garbd.

Any insight into how to prevent this would be greatly appreciated!

log_clt1.txt (14.3 KB)

log_atl.txt (22.6 KB)

matthewb · October 30, 2018, 5:58pm

Hi Brad,
In reading the clt1.txt file, I can see that this node had issues communicating with other nodes. Look at line 6-8, and 17-23. Lots of timeouts. So this node partitioned itself off and declared itself non-primary. On line 131 you can see it reestablished connection to the other nodes and joined the cluster. I see similar timeout messages in the atl log as well which confirms connectivity issues between atl and clt.
In the future, if you experience this “weird splitbrain”, you can manually force one side to be primary by running SET GLOBAL wsrep_provider_options=“pc.bootstrap=true;” on one of the surviving nodes. That will bring you back online. When the network heals, the joining nodes will IST any changes from the survivors.

bradh352 · October 31, 2018, 12:29pm

Sorry, I guess I should have provided a subnet map:
10.30.30.0/24 = Jax
10.30.40.0/24 = Clt
10.30.50.0/24 = Atl

So the loss of connectivity to 10.30.30.11 (db1.p10jax) and 10.30.30.12 (db2.p10jax) was the Jax failure.

10.30.40.11 (db1.p10clt), 10.30.40.12 (db2.p10clt), and 10.30.50.11 (db1.p10atl - garbd) were all up and could talk to eachother, so 3 out of 5 should have kept quorum.

matthewb · October 31, 2018, 2:25pm

Here’s what I see in the log and this is all I can tell you.

ctl1

Oct 27 15:23:37 connection to peer 6569ade9 with addr tcp://10.30.30.12:4567 timed out
Oct 27 15:23:39 connection to peer 8cc5e0ca with addr tcp://10.30.30.11:4567 timed out
Oct 27 15:24:01 ctl1 goes non-primary. It can only see itself.
Oct 27 15:24:01 declaring 10.30.40.12:4567 stable
ctl1 now sees itself and 1 other node. still non-primary
Oct 27 15:24:07 db1.p10clt mysqld[2339]: WSREP: declaring 677ffd76 at tcp://10.30.40.12:4567 stable
Oct 27 15:24:07 db1.p10clt mysqld[2339]: WSREP: declaring 70217988 at tcp://10.30.50.11:4567 stable
connections reestablished to .12 and .11. It should have gone primary here 3/5. But the log says " WSREP: Received NON-PRIMARY." So for some reason that I cannot determine by looking at this log the cluster decided to remain non-primary. Maybe the other nodes log have more around this timestamp.
Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 6569ade9 at tcp://10.30.30.12:4567 stable
Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 677ffd76 at tcp://10.30.40.12:4567 stable
Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 70217988 at tcp://10.30.50.11:4567 stable
Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 8cc5e0ca at tcp://10.30.30.11:4567 stable
ctl1 sees other nodes, cluster goes PRIMARY

atl log looks almost the same. Loss of 10.30.30.* created a non-primary state from this node’s perspective. I’m wondering if relay messaging was enabled and the loss of those nodes which were relaying created this non-primary status. Relaying enables/disables automatically when node A cannot talk directly to node B. node A requests C relay messages for them. That’s my only theory at this time based on what I can read from these logs.

bradh352 · November 11, 2018, 10:29am

We do have gmcast.segment set differently in each datacenter (jax=1, clt=2, atl=3). It seemed like that was “recommended”, but given we’re not bandwidth constrained between datacenters, the additional complexity and possibility for bugs with message relaying makes it not worthwhile.

Would you suggest disabling this?

matthewb · November 11, 2018, 8:28pm

Can you send me all of your configs? (just zip them up). I’d like to recreate your setup and test that out. segments are usually recommended with multi-datacenters to reduce WAN traffic, but if that isn’t an issue for you, sure, disable it. I’m curious how segments and message relay work together when a network outage occurs; this is what I’d like to test.

bradh352 · November 13, 2018, 9:10am

So weird, still not getting emails when I get replies, sorry for the long delay. I just changed my email address to my personal email incase my company server is blocking it for some reason.

I’ve attached the configs from each server.

pxc_conf.zip (13.2 KB)

lorraine.pocklington · November 19, 2018, 7:38am

Hi bradh352 just an aside - you might need to explicitly enable notification emails on your profile, due to GDPR we have them turned off by default I think.

bradh352 · November 19, 2018, 2:25pm

thanks, found the setting. hopefully I’ll get replies from now on.

matthewb · December 7, 2018, 8:08pm

Hi Brad,
I have not had a chance to simulate your issue yet. I’ve been traveling and training for Percona for the past month and I’m going on paternity leave soon. Interestingly, another PXC case that sounds similar to yours popped up in our internal chat and our lead PXC developer, Krunal, gave the following explanation for that other issue which I think also might answer yours.

Say I have 3 node cluster all working good.

One of the node (n2) isolate itself from the cluster. This updates pc state as n1 and n3. Cluster is still operational.

Now n1 too isolate itself there-by moving complete cluster to non-primary. (all nodes are now NON_PRIMARY)

Now if n2 rejoins back (while n1 is still down) we would assume since n2 + n3 can talk to each other they should able to form cluster.

This is where the twist is. Last known primary view was made using n1 and n3 and only n3 has recovered so still 50% quorum.

Once n1 joins back n1 + n2 + n3 can form primary.

Variation:

Say I have 3 node cluster all working good.

One of the node (n2) isolate itself from the cluster. This updates pc state as n1 and n3.

Now n1 too isolate itself there-by moving complete cluster to non-primary.

Now if n1 rejoins back (while n2 is still down) PRIMARY is formed as all component from last saved PC state are recovered.

bradh352 · December 11, 2018, 6:48pm

We’ve hired Percona for consulting and have linked in this thread. Yves Trudeau should hopefully be able to determine if possibly this occurred as he is scheduled all day Friday. That said, only the 2 nodes in Jacksonville were having issues, so in theory there shouldn’t have been any glitches with the other 3 members of the cluster … But Jacksonville was flapping, so I’m not sure if one was elected to be a donor during a flap if somehow that broke quorum when Jacksonville went back down before the sync completed.

bradh352 · December 30, 2018, 8:47am

I just wanted to report back on the conclusion of our consulting from Percona.

Yves suspects the issue might be the fact that we are not specifying evs.install_timeout as apparently it needs to be > than evs.inactive_timeout. The default value for the install timeout is 15s, but we increased inactive_timeout to 30s (from 15s) because we are in a WAN cluster so it inverted these. That requirement comes from [URL]http://galeracluster.com/documentation-webpages/configurationtips.html[/URL]

It is interesting, however, that MariaDB (not that we use MariaDB, but it too uses Galera) documents these settings, at default values, as inverted from the requirement:
[URL]wsrep_provider_options - MariaDB Knowledge Base
evs.inactive_timeout=PT30S
evs.install_timeout=PT15S

So if it is infact the cause, it seems to not to be a well understood configuration value.

That said, Yves was not able to reproduce in his test lab nor see something explicit in all the logs that would lead to an absolute conclusion about the cause.

We only hired a single day’s worth of consulting at Percona to research this issue, so I’m guessing our way forward is to combine both possible causes. First, we are removing the which would follow the best practices from Galera.

I guess if we experience a similar issue in the future, we’ll need to reopen the case and do some more research …

bradh352 · January 22, 2019, 6:58am

And, that won’t work …

Jan 22 07:18:48 mysqld[27747]: WSREP: failed to open gcomm backend connection: 34: parameter ‘evs.install_timeout’ value PT35S is out of range [PT1S,PT30S): 34 (Numerical result out of range)#012#011 at gcomm/src/gcomm/conf.hpp:check_range():564

hmm, guess I got some bad advice

Topic		Replies	Views
Network Parition results in two non-primary components. Percona XtraDB Cluster 5.x	1	1996	October 14, 2018
All PXC nodes entered non primary state at the same time Percona XtraDB Cluster 5.x	0	767	May 7, 2019
Cluster of 4 fails when one node disconnects Percona XtraDB Cluster 5.x	1	1042	July 6, 2021
Node is alone in cluster Percona XtraDB Cluster 8.x	2	982	March 30, 2023
One node crash/hang, other node become non-primary Percona XtraDB Cluster 5.x	1	613	July 10, 2020

Network split caused a quorum number of members to be non-prim

ctl1

Related topics