Network split caused a quorum number of members to be non-prim

Using Percona-XtraDB-Cluster-server-57-5.7.23-31.31.2.el7.x86_64 in a WAN-based setup with 3 datacenters (jax, clt, atl). 2 datacenters (jax, clt) have 2 nodes each, the third datacenter (atl) just runs garbd. These dcs were chosen because they have low latency interconnects between them (<12ms).

Our jax DC had a significant failure, and it was expected that since clt and atl were still online that it would have quorum and continue. In this past this has been true but somehow with this particular failure it was not. I can’t really tell why from the logs. But once jax came back online, everything came back up.

Would it help somehow if atl was a full database instance instead of garbd?

I’ve attached the logs for one clt node and the atl garbd.

Any insight into how to prevent this would be greatly appreciated!

log_clt1.txt (14.3 KB)

log_atl.txt (22.6 KB)

Hi Brad,
In reading the clt1.txt file, I can see that this node had issues communicating with other nodes. Look at line 6-8, and 17-23. Lots of timeouts. So this node partitioned itself off and declared itself non-primary. On line 131 you can see it reestablished connection to the other nodes and joined the cluster. I see similar timeout messages in the atl log as well which confirms connectivity issues between atl and clt.
In the future, if you experience this “weird splitbrain”, you can manually force one side to be primary by running SET GLOBAL wsrep_provider_options=“pc.bootstrap=true;” on one of the surviving nodes. That will bring you back online. When the network heals, the joining nodes will IST any changes from the survivors.

Sorry, I guess I should have provided a subnet map:
10.30.30.0/24 = Jax
10.30.40.0/24 = Clt
10.30.50.0/24 = Atl

So the loss of connectivity to 10.30.30.11 (db1.p10jax) and 10.30.30.12 (db2.p10jax) was the Jax failure.

10.30.40.11 (db1.p10clt), 10.30.40.12 (db2.p10clt), and 10.30.50.11 (db1.p10atl - garbd) were all up and could talk to eachother, so 3 out of 5 should have kept quorum.

Here’s what I see in the log and this is all I can tell you.

ctl1

Oct 27 15:23:37 connection to peer 6569ade9 with addr tcp://10.30.30.12:4567 timed out
Oct 27 15:23:39 connection to peer 8cc5e0ca with addr tcp://10.30.30.11:4567 timed out
Oct 27 15:24:01 ctl1 goes non-primary. It can only see itself.
Oct 27 15:24:01 declaring 10.30.40.12:4567 stable
ctl1 now sees itself and 1 other node. still non-primary
Oct 27 15:24:07 db1.p10clt mysqld[2339]: WSREP: declaring 677ffd76 at tcp://10.30.40.12:4567 stable
Oct 27 15:24:07 db1.p10clt mysqld[2339]: WSREP: declaring 70217988 at tcp://10.30.50.11:4567 stable
connections reestablished to .12 and .11. It should have gone primary here 3/5. But the log says " WSREP: Received NON-PRIMARY." So for some reason that I cannot determine by looking at this log the cluster decided to remain non-primary. Maybe the other nodes log have more around this timestamp.
Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 6569ade9 at tcp://10.30.30.12:4567 stable
Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 677ffd76 at tcp://10.30.40.12:4567 stable
Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 70217988 at tcp://10.30.50.11:4567 stable
Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 8cc5e0ca at tcp://10.30.30.11:4567 stable
ctl1 sees other nodes, cluster goes PRIMARY

atl log looks almost the same. Loss of 10.30.30.* created a non-primary state from this node’s perspective. I’m wondering if relay messaging was enabled and the loss of those nodes which were relaying created this non-primary status. Relaying enables/disables automatically when node A cannot talk directly to node B. node A requests C relay messages for them. That’s my only theory at this time based on what I can read from these logs.

We do have gmcast.segment set differently in each datacenter (jax=1, clt=2, atl=3). It seemed like that was “recommended”, but given we’re not bandwidth constrained between datacenters, the additional complexity and possibility for bugs with message relaying makes it not worthwhile.

Would you suggest disabling this?

Can you send me all of your configs? (just zip them up). I’d like to recreate your setup and test that out. segments are usually recommended with multi-datacenters to reduce WAN traffic, but if that isn’t an issue for you, sure, disable it. I’m curious how segments and message relay work together when a network outage occurs; this is what I’d like to test.

So weird, still not getting emails when I get replies, sorry for the long delay. I just changed my email address to my personal email incase my company server is blocking it for some reason.

I’ve attached the configs from each server.

pxc_conf.zip (13.2 KB)

Hi bradh352 just an aside - you might need to explicitly enable notification emails on your profile, due to GDPR we have them turned off by default I think.

thanks, found the setting. hopefully I’ll get replies from now on.

Hi Brad,
I have not had a chance to simulate your issue yet. I’ve been traveling and training for Percona for the past month and I’m going on paternity leave soon. Interestingly, another PXC case that sounds similar to yours popped up in our internal chat and our lead PXC developer, Krunal, gave the following explanation for that other issue which I think also might answer yours.

We’ve hired Percona for consulting and have linked in this thread. Yves Trudeau should hopefully be able to determine if possibly this occurred as he is scheduled all day Friday. That said, only the 2 nodes in Jacksonville were having issues, so in theory there shouldn’t have been any glitches with the other 3 members of the cluster … But Jacksonville was flapping, so I’m not sure if one was elected to be a donor during a flap if somehow that broke quorum when Jacksonville went back down before the sync completed.

I just wanted to report back on the conclusion of our consulting from Percona.

Yves suspects the issue might be the fact that we are not specifying evs.install_timeout as apparently it needs to be > than evs.inactive_timeout. The default value for the install timeout is 15s, but we increased inactive_timeout to 30s (from 15s) because we are in a WAN cluster so it inverted these. That requirement comes from [URL]http://galeracluster.com/documentation-webpages/configurationtips.html[/URL]

It is interesting, however, that MariaDB (not that we use MariaDB, but it too uses Galera) documents these settings, at default values, as inverted from the requirement:
[URL]wsrep_provider_options - MariaDB Knowledge Base
evs.inactive_timeout=PT30S
evs.install_timeout=PT15S

So if it is infact the cause, it seems to not to be a well understood configuration value.

That said, Yves was not able to reproduce in his test lab nor see something explicit in all the logs that would lead to an absolute conclusion about the cause.

We only hired a single day’s worth of consulting at Percona to research this issue, so I’m guessing our way forward is to combine both possible causes. First, we are removing the which would follow the best practices from Galera.

I guess if we experience a similar issue in the future, we’ll need to reopen the case and do some more research …

And, that won’t work …

Jan 22 07:18:48 mysqld[27747]: WSREP: failed to open gcomm backend connection: 34: parameter ‘evs.install_timeout’ value PT35S is out of range [PT1S,PT30S): 34 (Numerical result out of range)#012#011 at gcomm/src/gcomm/conf.hpp:check_range():564

hmm, guess I got some bad advice