Percona Xtradb Cluster Master node down

Hi Percona Team,

I am currently using a 4-node Percona XtraDB cluster (A, B, C and D). Recently, Node A went down with the error messages provided below. Could you please help me identify whether this issue is related to a network error or a database failure during the IST transfer?

Please make a note of mysql versions for all 4 nodes:
root@o6node87 ~ # date; mysql --version
Tue 21 Jan 2025 10:04:11 AM CET
mysql Ver 14.14 Distrib 5.7.43-47, for debian-linux-gnu (x86_64) using 8.0

root@o6node84 ~ # date; mysql --version
Tue 21 Jan 2025 10:04:40 AM CET
mysql Ver 14.14 Distrib 5.7.35-38, for debian-linux-gnu (x86_64) using 8.0

root@o6node85 ~ # date; mysql --version
Tue 21 Jan 2025 10:04:45 AM CET
mysql Ver 14.14 Distrib 5.7.35-38, for debian-linux-gnu (x86_64) using 8.0

root@o6node86 ~ # date; mysql --version
Tue 21 Jan 2025 10:04:52 AM CET
mysql Ver 14.14 Distrib 5.7.44-48, for debian-linux-gnu (x86_64) using 8.0

I have attached the log files for all 4 nodes for your reference. Kindly review them and assist me in determining the root cause of Node A’s failure.

Node A:
All cluster nodes were stable till 7:46:07 AM

2025-01-20T07:46:07.954759Z 0 [Note] WSREP: declaring 393253ce at tcp://10.0.1.84:4567 stable
2025-01-20T07:46:07.954775Z 0 [Note] WSREP: declaring a88b2d35 at tcp://10.0.1.85:4567 stable
2025-01-20T07:46:07.954782Z 0 [Note] WSREP: declaring f36e38d5 at tcp://10.0.1.86:4567 stable
2025-01-20T07:46:07.955258Z 0 [Note] WSREP: re-bootstrapping prim from partitioned components
2025-01-20T07:46:07.955917Z 0 [Note] WSREP: Current view of cluster as seen by this node

Node A Failure time: 07:46:08 AM

2025-01-20T07:46:08.965930Z 0 [Note] WSREP: Member 0.0 (o6node84) requested state transfer from ‘any’. Selected 1.0 (o6node85)(SYNCED) as donor.
2025-01-20T07:46:08.965939Z 0 [Note] WSREP: Member 2.0 (o6node87) requested state transfer from ‘any’. Selected 3.0 (o6node86)(SYNCED) as donor.
2025-01-20T07:46:08.965942Z 0 [Note] WSREP: Shifting PRIMARY → JOINER (TO: 3098538301)
2025-01-20T07:46:08.965986Z 9 [Note] WSREP: Requesting state transfer: success after 2 tries, donor: 3
2025-01-20T07:46:08.965998Z 9 [Note] WSREP: GCache history reset: d668a121-632f-11ec-a0b8-0b5f2060536a:3098538174 → d668a121-632f-11ec-a0b8-0b5f2060536a:3098538301
2025-01-20T07:46:08.967106Z 0 [Warning] WSREP: 3.0 (o6node86): State transfer to 2.0 (o6node87) failed: -61 (No data available)
025-01-20T07:46:08.967106Z 0 [Warning] WSREP: 3.0 (o6node86): State transfer to 2.0 (o6node87) failed: -61 (No data available)
2025-01-20T07:46:08.967113Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():800: State transfer request failed unrecoverably because the donor seqno had gone forward during IST, but SST request was not prepared from our side due to selected state transfer method (which do not supports SST during node operation). Restart required.
2025-01-20T07:46:08.967118Z 0 [Note] WSREP: gcomm: terminating thread
2025-01-20T07:46:08.967121Z 0 [Note] WSREP: gcomm: joining thread
2025-01-20T07:46:08.967428Z 0 [Note] WSREP: gcomm: closing backend
2025-01-20T07:46:08.973333Z 9 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset
2025-01-20T07:46:08.974104Z 9 [Note] WSREP: Receiving IST: 127 writesets, seqnos 3098538174-3098538301
2025-01-20T07:46:09.968583Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,393253ce,152)

Node B:
All cluster nodes were stable till 7:46:07 AM

2025-01-20T07:46:07.338853Z 22172278 [Note] Got timeout reading communication packets
2025-01-20T07:46:07.954682Z 0 [Note] WSREP: declaring 393253ce at tcp://10.0.1.84:4567 stable
2025-01-20T07:46:07.954694Z 0 [Note] WSREP: declaring ab6d8263 at tcp://10.0.1.87:4567 stable
2025-01-20T07:46:07.954697Z 0 [Note] WSREP: declaring f36e38d5 at tcp://10.0.1.86:4567 stable
2025-01-20T07:46:07.955199Z 0 [Note] WSREP: re-bootstrapping prim from partitioned components
2025-01-20T07:46:07.955903Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,393253ce,152)

Node A failed discovered 7:46:09 AM

2025-01-20T07:46:08.967100Z 0 [Warning] WSREP: 3.0 (o6node86): State transfer to 2.0 (o6node87) failed: -61 (No data available)
2025-01-20T07:46:08.967656Z 0 [Note] WSREP: 1.0 (o6node85): State transfer to 0.0 (o6node84) complete.
2025-01-20T07:46:08.967665Z 0 [Note] WSREP: Shifting DONOR/DESYNCED → JOINED (TO: 3098538301)
2025-01-20T07:46:08.984633Z 0 [Note] WSREP: async IST sender served
2025-01-20T07:46:09.452248Z 0 [Note] WSREP: (a88b2d35, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2025-01-20T07:46:09.968972Z 0 [Note] WSREP: (a88b2d35, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.0.1.87:4567
2025-01-20T07:46:09.970000Z 0 [Note] WSREP: declaring 393253ce at tcp://10.0.1.84:4567 stable
2025-01-20T07:46:09.970009Z 0 [Note] WSREP: declaring f36e38d5 at tcp://10.0.1.86:4567 stable
2025-01-20T07:46:09.970013Z 0 [Note] WSREP: forgetting ab6d8263 (tcp://10.0.1.87:4567)
2025-01-20T07:46:09.970028Z 0 [Note] WSREP: (a88b2d35, ‘tcp://0.0.0.0:4567’) turning message relay requesting off

Node C:
All cluster nodes were stable till 7:46:07 AM

2025-01-20T07:46:07.954773Z 0 [Note] WSREP: declaring a88b2d35 at tcp://10.0.1.85:4567 stable
2025-01-20T07:46:07.954784Z 0 [Note] WSREP: declaring ab6d8263 at tcp://10.0.1.87:4567 stable
2025-01-20T07:46:07.954787Z 0 [Note] WSREP: declaring f36e38d5 at tcp://10.0.1.86:4567 stable
2025-01-20T07:46:07.955320Z 0 [Note] WSREP: re-bootstrapping prim from partitioned components
2025-01-20T07:46:07.955914Z 0 [Note] WSREP: Current view of cluster as seen by this node

Node A failed discovered 7:46:09 AM

2025-01-20T07:46:08.967083Z 0 [Warning] WSREP: 3.0 (o6node86): State transfer to 2.0 (o6node87) failed: -61 (No
data available)
2025-01-20T07:46:08.967665Z 0 [Note] WSREP: 1.0 (o6node85): State transfer to 0.0 (o6node84) complete.
2025-01-20T07:46:08.980816Z 6 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset
2025-01-20T07:46:08.981040Z 6 [Note] WSREP: Receiving IST: 127 writesets, seqnos 3098538174-3098538301
2025-01-20T07:46:08.981109Z 0 [Note] WSREP: Receiving IST… 0.0% ( 0/127 events) complete.
2025-01-20T07:46:08.984395Z 0 [Note] WSREP: Receiving IST…100.0% (127/127 events) complete.
2025-01-20T07:46:08.984510Z 6 [Note] WSREP: IST received: d668a121-632f-11ec-a0b8-0b5f2060536a:3098538301
2025-01-20T07:46:09.083411Z 0 [Note] WSREP: (393253ce, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2025-01-20T07:46:09.969284Z 0 [Note] WSREP: (393253ce, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.0.1.87:4567
2025-01-20T07:46:09.969571Z 0 [Note] WSREP: declaring a88b2d35 at tcp://10.0.1.85:4567 stable
2025-01-20T07:46:09.969601Z 0 [Note] WSREP: declaring f36e38d5 at tcp://10.0.1.86:4567 stable
2025-01-20T07:46:09.969616Z 0 [Note] WSREP: forgetting ab6d8263 (tcp://10.0.1.87:4567)
2025-01-20T07:46:09.969614Z 0 [Note] WSREP: Member 3.0 (o6node86) synced with group.
2025-01-20T07:46:09.969664Z 0 [Note] WSREP: (393253ce, ‘tcp://0.0.0.0:4567’) turning message relay requesting off

Node D:
All cluster nodes were stable till 7:46:07 AM

2025-01-20T07:46:07.954732Z 0 [Note] WSREP: declaring 393253ce at tcp://10.0.1.84:4567 stable
2025-01-20T07:46:07.954744Z 0 [Note] WSREP: declaring a88b2d35 at tcp://10.0.1.85:4567 stable
2025-01-20T07:46:07.954747Z 0 [Note] WSREP: declaring ab6d8263 at tcp://10.0.1.87:4567 stable
2025-01-20T07:46:07.955195Z 0 [Note] WSREP: re-bootstrapping prim from partitioned components
2025-01-20T07:46:07.955858Z 0 [Note] WSREP: Current view of cluster as seen by this node

Node A failed discovered 7:46:09 AM

2025-01-20T07:46:08.967176Z 0 [Note] WSREP: Shifting DONOR/DESYNCED → JOINED (TO: 3098538301)
2025-01-20T07:46:08.967723Z 0 [Note] WSREP: 1.0 (o6node85): State transfer to 0.0 (o6node84) complete.
2025-01-20T07:46:09.605775Z 0 [Note] WSREP: (f36e38d5, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2025-01-20T07:46:09.968786Z 0 [Note] WSREP: (f36e38d5, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.0.1.87:4567
2025-01-20T07:46:09.970004Z 0 [Note] WSREP: declaring 393253ce at tcp://10.0.1.84:4567 stable
2025-01-20T07:46:09.970013Z 0 [Note] WSREP: declaring a88b2d35 at tcp://10.0.1.85:4567 stable
2025-01-20T07:46:09.970017Z 0 [Note] WSREP: forgetting ab6d8263 (tcp://10.0.1.87:4567)
2025-01-20T07:46:09.970030Z 0 [Note] WSREP: Member 3.0 (o6node86) synced with group.
2025-01-20T07:46:09.970041Z 0 [Note] WSREP: Shifting JOINED → SYNCED (TO: 3098538301)

Thank you in advance for your assistance!

Thanks and Regards,
Rahul Ambekar

Hi Percona Team,

Adding more details from error log for Node A:

2025-01-20T07:45:23.448205Z 61931434 [Note] WSREP: Victim thread:
THD: 61931434, mode: local, state: executing, conflict: cert failure, seqno: -1
SQL: replace oxseohistory ( oxobjectid, oxident, oxshopid, oxlang, oxinsert ) select oxobjectid, MD5( LOWER(
oxseourl ) ), oxshopid, oxlang, now() from oxseo
where oxtype =‘oxarticle’ and oxobjectid = ‘d4772b81f2d480e898d9c9f5417b95a3’ and oxshopid = ’
7’ and
oxlang = 7 and oxexpired = ‘1’

2025-01-20T07:45:23.448258Z 61931430 [Warning] WSREP: Send action {(nil), 432, TORDERED} returned -107 (Transpo
rt endpoint is not connected)
2025-01-20T07:45:23.448301Z 61931430 [Note] WSREP: --------- CONFLICT DETECTED --------
2025-01-20T07:45:23.448303Z 61931440 [Warning] WSREP: Send action {(nil), 432, TORDERED} returned -107 (Transpo
rt endpoint is not connected)
2025-01-20T07:45:23.448308Z 61931430 [Note] WSREP: cluster conflict due to certification failure for threads:

2025-01-20T07:45:23.448319Z 61931440 [Note] WSREP: --------- CONFLICT DETECTED --------
2025-01-20T07:45:23.448326Z 61931440 [Note] WSREP: cluster conflict due to certification failure for threads:

2025-01-20T07:45:23.448331Z 61931430 [Note] WSREP: Victim thread:
THD: 61931430, mode: local, state: executing, conflict: cert failure, seqno: -1
SQL: INSERT INTO oxcache ( oxid, oxexpire, oxreseton, oxsize, oxhits, oxshopid ) VALUES( ‘46d
c1a8e1337bce67eb0e6b1c7287d88’, ‘1737395122’, ‘ox|cid=95fed06b9a5ecccdc4ed4531f808adfc|cl=alist’, ‘432998’, ‘0’
, ‘24’ ) ON DUPLICATE KEY UPDATE oxexpire = ‘1737395122’, oxreseton = ‘ox|cid=95fed06b9a5ecccdc4ed4531f8
08adfc|cl=alist’, oxsize = ‘432998’, oxhits = ‘0’, oxshopid = ‘24’

mysql> show global variables like ‘wsrep_provider_options’ \G
*************************** 1. row ***************************
Variable_name: wsrep_provider_options
Value: base_dir = /var/lib/mysql/; base_host = 10.0.1.87; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.causal_keepalive_period = PT1S; evs.debug_log_mask = 0x1; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout = PT7.5S; evs.join_retrans_period = PT1S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 10; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true; evs.user_send_window = 4; evs.version = 0; evs.view_forget_timeout = P1D; gcache.dir = /var/lib/mysql/; gcache.freeze_purge_at_seqno = -1; gcache.keep_pages_count = 0; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 100; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.segment = 0; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = 10.0.1.87; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT20S; pc.npvo = false; pc.recovery = true; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 9; socket.checksum = 2; socket.recv_buf_size = auto; socket.send_buf_size = auto;