Not the answer you need?
Register and ask your own question!

Network split caused a quorum number of members to be non-prim

bradh352bradh352 ContributorCurrent User Role Beginner
Using Percona-XtraDB-Cluster-server-57-5.7.23-31.31.2.el7.x86_64 in a WAN-based setup with 3 datacenters (jax, clt, atl). 2 datacenters (jax, clt) have 2 nodes each, the third datacenter (atl) just runs garbd. These dcs were chosen because they have low latency interconnects between them (<12ms).

Our jax DC had a significant failure, and it was expected that since clt and atl were still online that it would have quorum and continue. In this past this has been true but somehow with this particular failure it was not. I can't really tell why from the logs. But once jax came back online, everything came back up.

Would it help somehow if atl was a full database instance instead of garbd?

I've attached the logs for one clt node and the atl garbd.

Any insight into how to prevent this would be greatly appreciated!

Comments

  • matthewbmatthewb Senior [email protected] Percona Staff Role
    Hi Brad,
    In reading the clt1.txt file, I can see that this node had issues communicating with other nodes. Look at line 6-8, and 17-23. Lots of timeouts. So this node partitioned itself off and declared itself non-primary. On line 131 you can see it reestablished connection to the other nodes and joined the cluster. I see similar timeout messages in the atl log as well which confirms connectivity issues between atl and clt.
    In the future, if you experience this "weird splitbrain", you can manually force one side to be primary by running SET GLOBAL wsrep_provider_options="pc.bootstrap=true;" on one of the surviving nodes. That will bring you back online. When the network heals, the joining nodes will IST any changes from the survivors.
  • bradh352bradh352 Contributor Current User Role Beginner
    Sorry, I guess I should have provided a subnet map:
    10.30.30.0/24 = Jax
    10.30.40.0/24 = Clt
    10.30.50.0/24 = Atl

    So the loss of connectivity to 10.30.30.11 (db1.p10jax) and 10.30.30.12 (db2.p10jax) was the Jax failure.

    10.30.40.11 (db1.p10clt), 10.30.40.12 (db2.p10clt), and 10.30.50.11 (db1.p10atl - garbd) were all up and could talk to eachother, so 3 out of 5 should have kept quorum.
  • matthewbmatthewb Senior [email protected] Percona Staff Role
    Here's what I see in the log and this is all I can tell you.

    ctl1
    ---
    Oct 27 15:23:37 connection to peer 6569ade9 with addr tcp://10.30.30.12:4567 timed out
    Oct 27 15:23:39 connection to peer 8cc5e0ca with addr tcp://10.30.30.11:4567 timed out
    Oct 27 15:24:01 ctl1 goes non-primary. It can only see itself.
    Oct 27 15:24:01 declaring 10.30.40.12:4567 stable
    ctl1 now sees itself and 1 other node. still non-primary
    Oct 27 15:24:07 db1.p10clt mysqld[2339]: WSREP: declaring 677ffd76 at tcp://10.30.40.12:4567 stable
    Oct 27 15:24:07 db1.p10clt mysqld[2339]: WSREP: declaring 70217988 at tcp://10.30.50.11:4567 stable
    connections reestablished to .12 and .11. It should have gone primary here 3/5. But the log says " WSREP: Received NON-PRIMARY." So for some reason that I cannot determine by looking at this log the cluster decided to remain non-primary. Maybe the other nodes log have more around this timestamp.
    Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 6569ade9 at tcp://10.30.30.12:4567 stable
    Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 677ffd76 at tcp://10.30.40.12:4567 stable
    Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 70217988 at tcp://10.30.50.11:4567 stable
    Oct 27 15:32:31 db1.p10clt mysqld[2339]: WSREP: declaring 8cc5e0ca at tcp://10.30.30.11:4567 stable
    ctl1 sees other nodes, cluster goes PRIMARY

    atl log looks almost the same. Loss of 10.30.30.* created a non-primary state from this node's perspective. I'm wondering if relay messaging was enabled and the loss of those nodes which were relaying created this non-primary status. Relaying enables/disables automatically when node A cannot talk directly to node B. node A requests C relay messages for them. That's my only theory at this time based on what I can read from these logs.
  • bradh352bradh352 Contributor Current User Role Beginner
    We do have gmcast.segment set differently in each datacenter (jax=1, clt=2, atl=3). It seemed like that was "recommended", but given we're not bandwidth constrained between datacenters, the additional complexity and possibility for bugs with message relaying makes it not worthwhile.

    Would you suggest disabling this?
  • matthewbmatthewb Senior [email protected] Percona Staff Role
    Can you send me all of your configs? (just zip them up). I'd like to recreate your setup and test that out. segments are usually recommended with multi-datacenters to reduce WAN traffic, but if that isn't an issue for you, sure, disable it. I'm curious how segments and message relay work together when a network outage occurs; this is what I'd like to test.
  • bradh352bradh352 Contributor Current User Role Beginner
    So weird, still not getting emails when I get replies, sorry for the long delay. I just changed my email address to my personal email incase my company server is blocking it for some reason.

    I've attached the configs from each server.
  • lorraine.pocklingtonlorraine.pocklington Percona Community Manager Legacy User Role Patron
    Hi bradh352 just an aside - you might need to explicitly enable notification emails on your profile, due to GDPR we have them turned off by default I think.
  • bradh352bradh352 Contributor Current User Role Beginner
    thanks, found the setting. hopefully I'll get replies from now on.
  • matthewbmatthewb Senior [email protected] Percona Staff Role
    Hi Brad,
    I have not had a chance to simulate your issue yet. I've been traveling and training for Percona for the past month and I'm going on paternity leave soon. Interestingly, another PXC case that sounds similar to yours popped up in our internal chat and our lead PXC developer, Krunal, gave the following explanation for that other issue which I think also might answer yours.
    1. Say I have 3 node cluster all working good.
    2. One of the node (n2) isolate itself from the cluster. This updates pc state as n1 and n3. Cluster is still operational.
    3. Now n1 too isolate itself there-by moving complete cluster to non-primary. (all nodes are now NON_PRIMARY)
    4. Now if n2 rejoins back (while n1 is still down) we would assume since n2 + n3 can talk to each other they should able to form cluster.
    5. This is where the twist is. Last known primary view was made using n1 and n3 and only n3 has recovered so still 50% quorum.
    6. Once n1 joins back n1 + n2 + n3 can form primary.

    Variation:
    1. Say I have 3 node cluster all working good.
    2. One of the node (n2) isolate itself from the cluster. This updates pc state as n1 and n3.
    3. Now n1 too isolate itself there-by moving complete cluster to non-primary.
    4. Now if n1 rejoins back (while n2 is still down) PRIMARY is formed as all component from last saved PC state are recovered.
  • bradh352bradh352 Contributor Current User Role Beginner
    We've hired Percona for consulting and have linked in this thread. Yves Trudeau should hopefully be able to determine if possibly this occurred as he is scheduled all day Friday. That said, only the 2 nodes in Jacksonville were having issues, so in theory there shouldn't have been any glitches with the other 3 members of the cluster ... But Jacksonville was flapping, so I'm not sure if one was elected to be a donor during a flap if somehow that broke quorum when Jacksonville went back down before the sync completed.
  • bradh352bradh352 Contributor Current User Role Beginner
    I just wanted to report back on the conclusion of our consulting from Percona.

    Yves suspects the issue might be the fact that we are not specifying evs.install_timeout as apparently it needs to be > than evs.inactive_timeout. The default value for the install timeout is 15s, but we increased inactive_timeout to 30s (from 15s) because we are in a WAN cluster so it inverted these. That requirement comes from http://galeracluster.com/documentation-webpages/configurationtips.html

    It is interesting, however, that MariaDB (not that we use MariaDB, but it too uses Galera) documents these settings, at default values, as inverted from the requirement:
    https://mariadb.com/kb/en/library/wsrep_provider_options/#evsinstall_timeout
    evs.inactive_timeout=PT30S
    evs.install_timeout=PT15S

    So if it is infact the cause, it seems to not to be a well understood configuration value.

    That said, Yves was not able to reproduce in his test lab nor see something explicit in all the logs that would lead to an absolute conclusion about the cause.

    We only hired a single day's worth of consulting at Percona to research this issue, so I'm guessing our way forward is to combine both possible causes. First, we are removing the which would follow the best practices from Galera.

    I guess if we experience a similar issue in the future, we'll need to reopen the case and do some more research ...
  • bradh352bradh352 Contributor Current User Role Beginner
    And, that won't work ...

    Jan 22 07:18:48 mysqld[27747]: WSREP: failed to open gcomm backend connection: 34: parameter 'evs.install_timeout' value PT35S is out of range [PT1S,PT30S): 34 (Numerical result out of range)#012#011 at gcomm/src/gcomm/conf.hpp:check_range():564

    hmm, guess I got some bad advice
Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.