PXC inconsistent view after network crash

Hi,
This is similar to : Network Parition results in two non-primary components. - Percona XtraDB Cluster 5.x - Percona Community Forum
We had a cluster outage too, some machines restarted and nodes were able to start again and see each others, but could not reach a primary stateWe are running 2 5.7.26 PXC + garbd

Everything looks like usual logs from every node except one, with this line (yes, it does miss a parenthesis):

2020-04-06T07:35:41.992687+01:00 0 [Warning] WSREP: node uuid: 366255e7 last_prim(type: 3, uuid: 06eac166) is inconsistent to restored view(type: V_NON_PRIM, uuid: 06eac166<br><br>


See the error.log, we have these exact messages looping all the time
Where do this warning come from ?

  • I still did not find it in source code
  • I can’t find any info on it, except the post I linked with no resolutions
  • I manually checked, every node do have the exact same view (in terms of view number, number of nodes, ID of nodes, who joined, who left and who is “partitionned”)

Also, this same node have a different value for “protocol” in this kind of log :

2020-04-06T07:11:30.421833+01:00 1 [Note] WSREP: New cluster view: global state: 8c88b4da-c72f-11e4-b875-b742ccccfc53:76007269, view# -1: non-Primary, number of nodes: 3, my index: 0, protocol version -1

Other node have “protocol version 3”. What would this difference mean ?


Thank you



error.log (1.32 KB)

Hi,
The warning comes from percona-xtradb-cluster-galera/gcomm/src/pc_proto.cpp, gcomm::pc::Proto::deliver_view() line 252.
The protocol version -1 is the consequence of node not being joined to the cluster.
You could try to collect more debug info by setting the following in my.cnf on all nodes: 
wsrep_debug=1
wsrep_provider_options=“debug = yes;other_options_here”

Hi,Thanks a lot, I will continue from there and update my post if I find anything interesting

For the sake of anyone searching a similar issue, that’s my understanding of the issue :

We had VMs outage, each node on its VMTurns out, one PXC (1) did not acknowledged the fact that the Garbd came back with another UUID. It kept a old UUID in its member list, and kept the state it had before the incident, hence PRIMARY (type 3)
The other PXC node (2) has a log saying " WSREP: remote endpoint tcp://x.x.x.x:4567 changed identity 366255e7 -> 0b05842a", while node 1 kept both

Node (1) would then say its view “was inconsistent”, because it had this node 366255e7 stored with last_prim = PRIMARY, when every other nodes were saying NOT PRIMARY

The fun thing is that I could restart node (1) or garbd as much as I wanted, and it still kept the old UUID and even added one more each time garbd was restarted, while node (2) always acknowledged identity changes
I had to shut everything and bootstrap again