Cluster of 4 fails when one node disconnects

Hello.
We got cluster of 4 nodes.
(176.xx.xx.xx) uk_pri weight = 2
(176.yy.yy.yy) uk weight =1
(209.xx.xx.xx) tor_pri weight = 1
(209.yy.yy.yy) tor weight =1

So in case that two sites got separated, we expect to UK to have quorum.
But behavior is different, which we don’t understand.
If TOR disconnects from UK, then uk becomes non-primary till it’s resynchronize with TOR.

Part of logs:

Blockquote
2021-03-10 11:42:24 0 [Note] WSREP: (373b9849, ‘ssl://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: ssl://209.xx.xx.xx:4567
2021-03-10 11:42:25 0 [Note] WSREP: (373b9849, ‘ssl://0.0.0.0:4567’) reconnecting to 7d7c739a (ssl://209.xx.xx.xx:4567), attempt 0
2021-03-10 11:42:28 0 [Note] WSREP: (373b9849, ‘ssl://0.0.0.0:4567’) connection to peer 00000000 with addr ssl://209.xx.xx.xx:4567 timed out, no messages seen in PT3S
2021-03-10 11:42:30 0 [Note] WSREP: SSL handshake successful, remote endpoint ssl://209.xx.xx.xx:42664 local endpoint ssl://176.xx.xx.xx:4567 cipher: ECDHE-RSA-AES256-GCM-SHA384 compression: none
2021-03-10 11:42:30 0 [Note] WSREP: SSL handshake successful, remote endpoint ssl://209.xx.xx.xx:4567 local endpoint ssl://176.xx.xx.xx:42368 cipher: ECDHE-RSA-AES256-GCM-SHA384 compression: none
2021-03-10 11:42:37 0 [Note] WSREP: (373b9849, ‘ssl://0.0.0.0:4567’) connection to peer 00000000 with addr ssl://209.xx.xx.xx:4567 timed out, no messages seen in PT3S
2021-03-10 11:42:40 0 [Note] WSREP: SSL handshake successful, remote endpoint ssl://209.xx.xx.xx:4567 local endpoint ssl://176.xx.xx.xx:42380 cipher: ECDHE-RSA-AES256-GCM-SHA384 compression: none
2021-03-10 11:42:40 0 [Note] WSREP: (373b9849, ‘ssl://0.0.0.0:4567’) connection established to 7d7c739a ssl://209.xx.xx.xx:4567
2021-03-10 11:42:41 0 [Note] WSREP: view(view_id(NON_PRIM,373b9849,299) memb {
373b9849,0
} joined {
} left {
} partitioned {
7d7c739a,0
94c5c8f1,0
f44e0fcd,0
})
2021-03-10 11:42:41 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2021-03-10 11:42:41 0 [Note] WSREP: view(view_id(NON_PRIM,373b9849,300) memb {
373b9849,0
} joined {
} left {
} partitioned {
7d7c739a,0
94c5c8f1,0
f44e0fcd,0
})
2021-03-10 11:42:41 0 [Note] WSREP: Writing down CC checksum: 9272b06d 0aba31d5 fbbd3478 59806d60 at offset 120
2021-03-10 11:42:41 0 [Note] WSREP: Flow-control interval: [16, 16]
2021-03-10 11:42:41 0 [Note] WSREP: Trying to continue unpaused monitor
2021-03-10 11:42:41 0 [Note] WSREP: Received NON-PRIMARY.
2021-03-10 11:42:41 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6753693)
2021-03-10 11:42:41 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2021-03-10 11:42:41 0 [Note] WSREP: Writing down CC checksum: 9272b06d 0aba31d5 fbbd3478 59806d60 at offset 120
2021-03-10 11:42:41 0 [Note] WSREP: Flow-control interval: [16, 16]
2021-03-10 11:42:41 0 [Note] WSREP: Trying to continue unpaused monitor
2021-03-10 11:42:41 0 [Note] WSREP: Received NON-PRIMARY.
2021-03-10 11:42:41 3 [Note] WSREP: ####### processing CC -1, local, ordered
2021-03-10 11:42:41 3 [Note] WSREP: ####### drain monitors upto 6753693
2021-03-10 11:42:41 3 [Note] WSREP: ####### My UUID: 373b9849-7a05-11eb-823a-fbe0b42518a3
2021-03-10 11:42:41 3 [Note] WSREP: ####### ST not required
2021-03-10 11:42:41 3 [Note] WSREP: ================================================
View:
* id: 6749f221-0237-11ea-bed2-c3eba6a44a39:-1*
* status: non-primary*
* protocol_version: 4*
* capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO*
* final: no*
* own_index: 0*
* members(1):*
* 0: 373b9849-7a05-11eb-823a-fbe0b42518a3, UK_pri*
=================================================
2021-03-10 11:42:41 3 [Note] WSREP: Non-primary view
2021-03-10 11:42:41 3 [Note] WSREP: Server status change synced → connected
2021-03-10 11:42:41 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-03-10 11:42:41 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-03-10 11:42:41 3 [Note] WSREP: ####### processing CC -1, local, ordered
2021-03-10 11:42:41 3 [Note] WSREP: ####### drain monitors upto 6753693
2021-03-10 11:42:41 3 [Note] WSREP: ####### My UUID: 373b9849-7a05-11eb-823a-fbe0b42518a3
2021-03-10 11:42:41 3 [Note] WSREP: ####### ST not required
2021-03-10 11:42:41 3 [Note] WSREP: ================================================
View:
id: 6749f221-0237-11ea-bed2-c3eba6a44a39:-1
status: non-primary
protocol_version: 4
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
final: no
own_index: 0
members(1):
0: 373b9849-7a05-11eb-823a-fbe0b42518a3, UK_pri
=================================================
2021-03-10 11:42:41 3 [Note] WSREP: Non-primary view
2021-03-10 11:42:41 3 [Note] WSREP: Server status change connected → connected
2021-03-10 11:42:41 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-03-10 11:42:41 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-03-10 11:42:43 0 [Note] WSREP: (373b9849, ‘ssl://0.0.0.0:4567’) turning message relay requesting off
2021-03-10 11:42:43 0 [Note] WSREP: declaring 7d7c739a at ssl://209.xx.xx.xx:4567 stable
2021-03-10 11:42:43 0 [Note] WSREP: declaring 94c5c8f1 at ssl://209.yy.yy.yy:4567 stable
2021-03-10 11:42:43 0 [Note] WSREP: declaring f44e0fcd at ssl://176.yy.yy.yy:4567 stable
2021-03-10 11:42:45 0 [Note] WSREP: re-bootstrapping prim from partitioned components
2021-03-10 11:42:45 0 [Note] WSREP: view(view_id(PRIM,373b9849,301) memb {
373b9849,0
7d7c739a,0
94c5c8f1,0
f44e0fcd,0
} joined {
} left {
} partitioned {
})
2021-03-10 11:42:45 0 [Note] WSREP: save pc into disk
2021-03-10 11:42:45 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 4
2021-03-10 11:42:45 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: baf7eebb-8195-11eb-94fd-ce21d9e1c737
2021-03-10 11:42:45 0 [Note] WSREP: STATE EXCHANGE: sent state msg: baf7eebb-8195-11eb-94fd-ce21d9e1c737
2021-03-10 11:42:45 0 [Note] WSREP: STATE EXCHANGE: got state msg: baf7eebb-8195-11eb-94fd-ce21d9e1c737 from 0 (UK_pri)
2021-03-10 11:42:45 0 [Note] WSREP: STATE EXCHANGE: got state msg: baf7eebb-8195-11eb-94fd-ce21d9e1c737 from 2 (TOR)
2021-03-10 11:42:45 0 [Note] WSREP: STATE EXCHANGE: got state msg: baf7eebb-8195-11eb-94fd-ce21d9e1c737 from 3 (UK)
2021-03-10 11:42:45 0 [Note] WSREP: STATE EXCHANGE: got state msg: baf7eebb-8195-11eb-94fd-ce21d9e1c737 from 1 (TOR_pri)
2021-03-10 11:42:45 0 [Warning] WSREP: Quorum: No node with complete state:

Version      : 5
Flags        : 0x3
Protocols    : 1 / 10 / 4
State        : NON-PRIMARY
Desync count : 0
Prim state   : SYNCED
Prim UUID    : f2455373-814a-11eb-b0fd-b2828d3bd9a6
Prim  seqno  : 49
First seqno  : 6651656
Last  seqno  : 6753693
Commit cut   : 6753606
Last vote    : -1.0
Vote policy  : 0
Prim JOINED  : 4
State UUID   : baf7eebb-8195-11eb-94fd-ce21d9e1c737
Group UUID   : 6749f221-0237-11ea-bed2-c3eba6a44a39
Name         : 'UK_pri'
Incoming addr: 'AUTO'

Version      : 6
Flags        : 0x2
Protocols    : 2 / 10 / 4
State        : NON-PRIMARY
Desync count : 0
Prim state   : SYNCED
Prim UUID    : f2455373-814a-11eb-b0fd-b2828d3bd9a6
Prim  seqno  : 49
First seqno  : 6651658
Last  seqno  : 6753693
Commit cut   : 6753621
Last vote    : -1.0
Vote policy  : 0
Prim JOINED  : 4
State UUID   : baf7eebb-8195-11eb-94fd-ce21d9e1c737
Group UUID   : 6749f221-0237-11ea-bed2-c3eba6a44a39
Name         : 'TOR_pri'
Incoming addr: 'AUTO'

Version      : 5
Flags        : 0x2
Protocols    : 1 / 10 / 4
State        : NON-PRIMARY
Desync count : 0
Prim state   : SYNCED
Prim UUID    : f2455373-814a-11eb-b0fd-b2828d3bd9a6
Prim  seqno  : 49
First seqno  : 6651657
Last  seqno  : 6753693
Commit cut   : 6753678
Last vote    : -1.0
Vote policy  : 0
Prim JOINED  : 4
State UUID   : baf7eebb-8195-11eb-94fd-ce21d9e1c737
Group UUID   : 6749f221-0237-11ea-bed2-c3eba6a44a39
Name         : 'TOR'
Incoming addr: 'AUTO'

Version      : 5
Flags        : 0x2
Protocols    : 1 / 10 / 4
State        : NON-PRIMARY
Desync count : 0
Prim state   : SYNCED
Prim UUID    : f2455373-814a-11eb-b0fd-b2828d3bd9a6
Prim  seqno  : 49
First seqno  : 6651657
Last  seqno  : 6753693
Commit cut   : 6753665
Last vote    : -1.0
Vote policy  : 0
Prim JOINED  : 4
State UUID   : baf7eebb-8195-11eb-94fd-ce21d9e1c737
Group UUID   : 6749f221-0237-11ea-bed2-c3eba6a44a39
Name         : 'UK'
Incoming addr: 'AUTO'

Blockquote
2021-03-10 11:42:45 0 [Note] WSREP: Full re-merge of primary f2455373-814a-11eb-b0fd-b2828d3bd9a6 found: 4 of 4.
2021-03-10 11:42:45 0 [Note] WSREP: Quorum results:
version = 5,
component = PRIMARY,
conf_id = 49,
members = 4/4 (joined/total),
act_id = 6753693,
last_appl. = 6753606,
protocols = 1/10/4 (gcs/repl/appl),
vote policy= 0,
group UUID = 6749f221-0237-11ea-bed2-c3eba6a44a39
2021-03-10 11:42:45 0 [Note] WSREP: Writing down CC checksum: 96f85630 3414fafd eb4697dc a9cb2460 at offset 296
2021-03-10 11:42:45 0 [Note] WSREP: Flow-control interval: [32, 32]
2021-03-10 11:42:45 0 [Note] WSREP: Trying to continue unpaused monitor
2021-03-10 11:42:45 0 [Note] WSREP: Restored state OPEN → SYNCED (6753694)
2021-03-10 11:42:45 3 [Note] WSREP: ####### processing CC 6753694, local, ordered
2021-03-10 11:42:45 3 [Note] WSREP: ####### drain monitors upto 6753693
2021-03-10 11:42:45 3 [Note] WSREP: REPL Protocols: 10 (5, 3)
2021-03-10 11:42:45 3 [Note] WSREP: ####### My UUID: 373b9849-7a05-11eb-823a-fbe0b42518a3
2021-03-10 11:42:45 3 [Note] WSREP: ####### ST not required
2021-03-10 11:42:45 3 [Note] WSREP: Skipping cert index reset
2021-03-10 11:42:45 3 [Note] WSREP: ####### Adjusting cert position: 6753693 → 6753694
2021-03-10 11:42:45 0 [Note] WSREP: Service thread queue flushed.
2021-03-10 11:42:45 3 [Note] WSREP: ####### Setting monitor position to 6753694
2021-03-10 11:42:45 3 [Note] WSREP: Server UK_pri synced with group
2021-03-10 11:42:45 3 [Note] WSREP: Server status change connected → synced
2021-03-10 11:42:45 3 [Note] WSREP: Synchronized with group, ready for connections
2021-03-10 11:42:45 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-03-10 11:42:45 3 [Note] WSREP: Lowest cert indnex boundary for CC from group: 6753607
2021-03-10 11:42:45 3 [Note] WSREP: Min available from gcache for CC from group: 6651656
2021-03-10 11:42:45 3 [Note] WSREP: ================================================
View:
id: 6749f221-0237-11ea-bed2-c3eba6a44a39:6753694
status: primary
protocol_version: 4
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
final: no
own_index: 0
members(4):
0: 373b9849-7a05-11eb-823a-fbe0b42518a3, UK_pri
1: 7d7c739a-7a04-11eb-b72c-170dc2a798d0, TOR_pri
2: 94c5c8f1-7a05-11eb-be6b-b243bf48f624, TOR
3: f44e0fcd-7a04-11eb-b371-e3c58ee7bdb0, UK

We still have timers to tweak to match WAN environment.
But still I don’t understand why whole cluster had to break off and resynchronize. During this time it’s unavailable. Is there something we are missing?

Thanks for any advice.

I’d like to see what happens during controlled shutdown. if you gracefully stop TOR_PRI, what happens to the cluster? Then, shut down TOR, what happens?