WSREP unexpected view after some node disconnected

Hello,

recently I have noticed a strange wsrep behaviour. 2 out of 5 nodes was disconnected due network failure, but whole cluster switched to NON-PRIMARY.
My logs:
WSREP: (9208fdf3, ‘tcp://0.0.0.0:4567’) reconnecting to 468be6f0 (tcp://192.168.10.74:4567), attempt 0
2019-03-22T09:41:24.881669+01:00 0 [Note] WSREP: (9208fdf3, ‘tcp://0.0.0.0:4567’) connection established to 01a25465 tcp://192.168.10.72:4567
2019-03-22T09:41:24.881833+01:00 0 [Note] WSREP: (9208fdf3, ‘tcp://0.0.0.0:4567’) connection established to 468be6f0 tcp://192.168.10.74:4567
2019-03-22T09:41:27.898065+01:00 0 [Note] WSREP: (9208fdf3, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2019-03-22T09:41:28.022054+01:00 0 [Note] WSREP: (9208fdf3, ‘tcp://0.0.0.0:4567’) connection to peer 468be6f0 with addr tcp://192.168.10.74:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2019-03-22T09:41:28.022191+01:00 0 [Note] WSREP: (9208fdf3, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.10.74:4567
2019-03-22T09:41:29.377468+01:00 0 [Note] WSREP: (9208fdf3, ‘tcp://0.0.0.0:4567’) reconnecting to 01a25465 (tcp://192.168.10.72:4567), attempt 0
2019-03-22T09:41:29.377849+01:00 0 [Note] WSREP: (9208fdf3, ‘tcp://0.0.0.0:4567’) reconnecting to 468be6f0 (tcp://192.168.10.74:4567), attempt 0
2019-03-22T09:41:29.397998+01:00 0 [Note] WSREP: (9208fdf3, ‘tcp://0.0.0.0:4567’) connection established to 01a25465 tcp://192.168.10.72:4567
2019-03-22T09:41:29.886418+01:00 0 [Note] WSREP: declaring node with index 0 suspected, timeout PT5S (evs.suspect_timeout)
2019-03-22T09:41:29.886554+01:00 0 [Note] WSREP: declaring node with index 0 inactive (evs.inactive_timeout)
2019-03-22T09:41:29.886591+01:00 0 [Note] WSREP: declaring node with index 2 suspected, timeout PT5S (evs.suspect_timeout)
2019-03-22T09:41:29.886611+01:00 0 [Note] WSREP: declaring node with index 2 inactive (evs.inactive_timeout)
2019-03-22T09:41:29.886630+01:00 0 [Note] WSREP: declaring node with index 3 suspected, timeout PT5S (evs.suspect_timeout)
2019-03-22T09:41:29.886678+01:00 0 [Note] WSREP: declaring node with index 3 inactive (evs.inactive_timeout)
2019-03-22T09:41:29.887246+01:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,01a25465,262)
memb {
26d76972,0
9208fdf3,0
}
joined {
}
left {
}
partitioned {
01a25465,0
3eda445d,1
468be6f0,0
}
)
2019-03-22T09:41:29.887371+01:00 0 [Note] WSREP: declaring 26d76972 at tcp://192.168.3.73:4567 stable
2019-03-22T09:41:29.887388+01:00 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 2
2019-03-22T09:41:29.887483+01:00 0 [Note] WSREP: Flow-control interval: [141, 141]
2019-03-22T09:41:29.887507+01:00 0 [Note] WSREP: Trying to continue unpaused monitor
2019-03-22T09:41:29.887661+01:00 0 [Note] WSREP: Received NON-PRIMARY.
2019-03-22T09:41:29.887683+01:00 0 [Note] WSREP: Shifting SYNCED → OPEN (TO: 26437404)
2019-03-22T09:41:29.887930+01:00 4 [Note] WSREP: New cluster view: global state: 95ff7dd1-f262-11e8-b1df-563e1d9a7a37:26437404, view# -1: non-Primary, number of nodes: 2, my index: 1, protocol version 3
2019-03-22T09:41:29.887973+01:00 4 [Note] WSREP: Setting wsrep_ready to false
2019-03-22T09:41:29.887994+01:00 4 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2019-03-22T09:41:29.888044+01:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,26d76972,263)
memb {
26d76972,0
9208fdf3,0
}
joined {
}
left {
}
partitioned {
01a25465,0
3eda445d,1
468be6f0,0
}
)
2019-03-22T09:41:29.888141+01:00 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 2
2019-03-22T09:41:29.888191+01:00 0 [Note] WSREP: Flow-control interval: [141, 141]
2019-03-22T09:41:29.888209+01:00 0 [Note] WSREP: Trying to continue unpaused monitor
2019-03-22T09:41:29.888222+01:00 0 [Note] WSREP: Received NON-PRIMARY.
2019-03-22T09:41:29.888290+01:00 4 [Note] WSREP: New cluster view: global state: 95ff7dd1-f262-11e8-b1df-563e1d9a7a37:26437404, view# -1: non-Primary, number of nodes: 2, my index: 1, protocol version 3
2019-03-22T09:41:29.888312+01:00 4 [Note] WSREP: Setting wsrep_ready to false
2019-03-22T09:41:29.888326+01:00 4 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2019-03-22T09:41:29.929301+01:00 0 [Note] WSREP: declaring 26d76972 at tcp://192.168.3.73:4567 stable
2019-03-22T09:41:29.929407+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable
2019-03-22T09:41:29.932434+01:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,26d76972,264)
memb {
26d76972,0
3eda445d,1
9208fdf3,0
}
joined {
}
left {
}
partitioned {
01a25465,0
468be6f0,0
}
)
2019-03-22T09:41:29.932607+01:00 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 2, memb_num = 3

I wonder why 3 nodes was declared inactive (index 0 2 and 3) but only two (192.168.10.72 and 192.168.10.74) has been lost?
How to determine which index (0 to 4) is related to which node id?
Why cluster switched to NON-PRIMARY state?

BR,
MS

It switched to non-primary because at first this node declared three others as unreachable:

2019-03-22T09:41:29.887246+01:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,01a25465,262)
memb {
26d76972,0
9208fdf3,0
}
joined {
}
left {
}
partitioned {
01a25465,0
3eda445d,1
468be6f0,0
}

The node 3eda445d was added later, and it will need additional consensus check to restore the primary component. Did this happen later?

Btw, you can see id - IP info in messages like these:

2019-03-22T09:41:29.377468+01:00 0 [Note] WSREP: (9208fdf3, 'tcp://0.0.0.0:4567') reconnecting to [B]01a25465[/B] (tcp://[B]192.168.10.72[/B]:4567), attempt 0

My biggest concern is why 3eda445d was declared unreachable but wasn’t “timeouted” earlier.

# zgrep 3eda445d mysqld.log.5 | grep WSREP:
2019-03-22T09:41:29.929407+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable
2019-03-22T09:41:35.728077+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable
2019-03-22T09:41:36.014915+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable
2019-03-22T09:43:31.545161+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable
2019-03-22T09:44:39.581167+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable
2019-03-22T09:44:50.523938+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable
2019-03-22T09:45:09.052422+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable
2019-03-22T09:45:36.014648+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable
2019-03-22T09:45:36.021599+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable
2019-03-22T11:47:48.840525+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable
2019-03-22T11:47:56.362276+01:00 0 [Note] WSREP: declaring 3eda445d at tcp://192.168.140.50:4567 stable

In fact any message about its reachability was a few days before:

2019-03-19T00:10:22.362488+01:00 0 [Note] WSREP: (9208fdf3, 'tcp://0.0.0.0:4567') reconnecting to 3eda445d (tcp://192.168.140.50:4567), attempt 90
2019-03-19T00:11:42.525537+01:00 0 [Note] WSREP: (9208fdf3, 'tcp://0.0.0.0:4567') connection established to 3eda445d tcp://192.168.140.50:4567

I had to restore primary component


SET GLOBAL wsrep_provider_options='pc.bootstrap=YES'; 

I asked about index - id junction, not id - ip :slight_smile:

Anyway thank you very much for your answer.