PXC inconsistent view after network crash

La_Cancellera_Yoann · April 6, 2020, 4:00am

Hi,
This is similar to : Network Parition results in two non-primary components. - Percona XtraDB Cluster 5.x - Percona Community Forum
We had a cluster outage too, some machines restarted and nodes were able to start again and see each others, but could not reach a primary stateWe are running 2 5.7.26 PXC + garbd

Everything looks like usual logs from every node except one, with this line (yes, it does miss a parenthesis):

2020-04-06T07:35:41.992687+01:00 0 [Warning] WSREP: node uuid: 366255e7 last_prim(type: 3, uuid: 06eac166) is inconsistent to restored view(type: V_NON_PRIM, uuid: 06eac166<br><br>

See the error.log, we have these exact messages looping all the time
Where do this warning come from ?

I still did not find it in source code
I can’t find any info on it, except the post I linked with no resolutions
I manually checked, every node do have the exact same view (in terms of view number, number of nodes, ID of nodes, who joined, who left and who is “partitionned”)

Also, this same node have a different value for “protocol” in this kind of log :

2020-04-06T07:11:30.421833+01:00 1 [Note] WSREP: New cluster view: global state: 8c88b4da-c72f-11e4-b875-b742ccccfc53:76007269, view# -1: non-Primary, number of nodes: 3, my index: 0, protocol version -1

Other node have “protocol version 3”. What would this difference mean ?

Thank you

error.log (1.32 KB)

Kamil_Holubicki · April 8, 2020, 4:07am

Hi,
The warning comes from percona-xtradb-cluster-galera/gcomm/src/pc_proto.cpp, gcomm::pc::Proto::deliver_view() line 252.
The protocol version -1 is the consequence of node not being joined to the cluster.
You could try to collect more debug info by setting the following in my.cnf on all nodes:
wsrep_debug=1
wsrep_provider_options=“debug = yes;other_options_here”

La_Cancellera_Yoann · April 8, 2020, 7:13am

Hi,Thanks a lot, I will continue from there and update my post if I find anything interesting

La_Cancellera_Yoann · April 8, 2020, 9:16am

For the sake of anyone searching a similar issue, that’s my understanding of the issue :

We had VMs outage, each node on its VMTurns out, one PXC (1) did not acknowledged the fact that the Garbd came back with another UUID. It kept a old UUID in its member list, and kept the state it had before the incident, hence PRIMARY (type 3)
The other PXC node (2) has a log saying " WSREP: remote endpoint tcp://x.x.x.x:4567 changed identity 366255e7 -> 0b05842a", while node 1 kept both

Node (1) would then say its view “was inconsistent”, because it had this node 366255e7 stored with last_prim = PRIMARY, when every other nodes were saying NOT PRIMARY

The fun thing is that I could restart node (1) or garbd as much as I wanted, and it still kept the old UUID and even added one more each time garbd was restarted, while node (2) always acknowledged identity changes
I had to shut everything and bootstrap again

Topic		Replies	Views
Network Parition results in two non-primary components. Percona XtraDB Cluster 5.x	1	1952	October 14, 2018
All PXC nodes entered non primary state at the same time Percona XtraDB Cluster 5.x	0	743	May 7, 2019
PXC All Node is crash!!!! Percona XtraDB Cluster 5.x	1	873	March 14, 2014
PXC 5.6 crashes while blacklisting some IPs Percona XtraDB Cluster 5.x	2	1610	April 26, 2014
Any node is not coming up after all nodes goes down in PXC Percona XtraDB Cluster 5.x	1	1174	August 10, 2018

PXC inconsistent view after network crash

Related topics