Mysqld got signal 11 right after “Initiating SST cancellation” — race in wsrep 26.1.4.3?

We’re hitting intermittent crashes on one node when an SST cancellation is triggered:

[WSREP] Initiating SST cancellation
mysqld got signal 11 (Thread pointer: 0x0)

  • Crashes occur intermittently during or shortly after SST, when the joiner receives an
    Initiating SST cancellation signal.
    Sometimes it happens while SST is still running (JOINER), sometimes a few minutes after it becomes SYNCED.
    In all cases, the node is a joiner, and the crash follows the same pattern — SST cancellation → signal 11.

  • The affected node is always a joiner, never a donor.

  • Other nodes (donors) remain stable; wsrep_conf_id is identical across the cluster.

  • Seen on PXC 8.0.43-34.1, 8.4.5, and 8.4.6-6, all running
    wsrep_provider_version = 26.1.4.3 (Galera 4.23 cb05b32).

Looks like a race between Galera’s SST cancel signal and mysqld’s SST cleanup.
Is this a known issue in wsrep 26.1.4.x / Galera 4.23? Was it fixed in 26.1.5 or later?

Hey there,
Have you checked https://jira.percona.com for any similar reports? If you don’t find anything related, please open a new bug report and provide a repeatable test case with full configuration files, and step-by-step how you created the environment.

Hi Aleksander_Lutsenko!

Have you checked system logs?

Do you have a stack trace?

Having a reproducible test case will help us troubleshoot and fix a bug (if there is any)

Regards

2025-10-31T13:20:59.612343Z 12 [Note] [MY-000000] [Galera] ================================================
View:
  id: e2daaf86-a9d1-11f0-adf4-9f5cd9c4b723:9743914
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 1
  members(3):
        0: 2e3e8117-b62c-11f0-9799-6fa776bf02e9, node3
        1: 6263c463-b62b-11f0-b94c-4276b90ac0ba, node1
        2: ff877df9-b62b-11f0-9f7b-ff64e72ca667, node2
=================================================
2025-10-31T13:20:59.612406Z 0 [Note] [MY-000000] [Galera] Receiving IST... 100.0% (61/61 events) complete.
2025-10-31T13:20:59.612417Z 12 [Note] [MY-000000] [WSREP] Server status change initialized -> joined
2025-10-31T13:20:59.612473Z 12 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2025-10-31T13:20:59.612525Z 12 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2025-10-31T13:20:59.613362Z 1 [Note] [MY-000000] [Galera] Draining apply monitors after IST up to 9743914
2025-10-31T13:20:59.615022Z 1 [Note] [MY-000000] [Galera] IST received: e2daaf86-a9d1-11f0-adf4-9f5cd9c4b723:9743914
2025-10-31T13:20:59.615216Z 1 [Note] [MY-000000] [Galera] Recording CC from sst: 9743914
2025-10-31T13:20:59.615240Z 1 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from sst: 9743854
2025-10-31T13:20:59.615252Z 1 [Note] [MY-000000] [Galera] Min available from gcache for CC from sst: 9409337
2025-10-31T13:20:59.615997Z 0 [Note] [MY-000000] [Galera] 1.0 (node1): State transfer from 2.0 (node2) complete.
2025-10-31T13:20:59.616045Z 0 [Note] [MY-000000] [Galera] SST leaving flow control
2025-10-31T13:20:59.616075Z 0 [Note] [MY-000000] [Galera] Shifting JOINER -> JOINED (TO: 9743914)
2025-10-31T13:20:59.616202Z 0 [Note] [MY-000000] [Galera] Processing event queue:... -nan% (0/0 events) complete.
2025-10-31T13:20:59.616660Z 0 [Note] [MY-000000] [Galera] Member 1.0 (node1) synced with group.
2025-10-31T13:20:59.616704Z 0 [Note] [MY-000000] [Galera] Processing event queue:... 100.0% (1/1 events) complete.
2025-10-31T13:20:59.616735Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 9743914)
2025-10-31T13:20:59.616787Z 1 [Note] [MY-000000] [Galera] Server node1 synced with group
2025-10-31T13:20:59.616807Z 1 [Note] [MY-000000] [WSREP] Server status change joined -> synced
2025-10-31T13:20:59.616819Z 1 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2025-10-31T13:20:59.616827Z 1 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2025-10-31T13:20:59.794547Z 52 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is deprecated and will be removed in a future release. Please use caching_sha2_password instead'
2025-11-02T11:54:21.529104Z 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation
2025-11-02T11:54:21Z UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
BuildID[sha1]=b8278ace60e0a9944a074017a851f520a2e4b698
Server Version: 8.0.43-34.1 Percona XtraDB Cluster (GPL), Release rel34, Revision 0682ba7, WSREP version 26.1.4.3, wsrep_26.1.4.3

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Log of wsrep recovery (--wsrep-recover):
 INFO: WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery_verbose.WsZnJR' --pid-file='/var/lib/mysql/node-01-recover.pid'
 INFO: WSREP: Recovered position e2daaf86-a9d1-11f0-adf4-9f5cd9c4b723:10051630

This is the most interesting part. SST cancellation happens after 2 days (check timestamps) after node has synced.
We’re in a process of gathering all the information. The problem is this issue is sporadic and we don’t have a test case to reproduce it.

Everything was fine for almost 2 days. Then you get an SST scenario which crashes this node. What is in the logs on the other two nodes for this same time frame?

Logs from other nodes:

Node2:

2025-10-31T13:20:55.446335Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 9743914)
2025-10-31T13:20:55.446395Z 25 [Note] [MY-000000] [Galera] Server node2 synced with group
2025-10-31T13:20:55.446459Z 25 [Note] [MY-000000] [WSREP] Server status change joined -> synced
2025-10-31T13:20:55.446485Z 25 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2025-10-31T13:20:55.446572Z 25 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2025-10-31T13:20:59.613226Z 0 [Note] [MY-000000] [Galera] IST sender finished waiting for connection close
2025-10-31T13:20:59.613707Z 0 [Note] [MY-000000] [Galera] async IST sender served
2025-10-31T13:20:59.616468Z 0 [Note] [MY-000000] [Galera] 1.0 (node1): State transfer from 2.0 (node2) complete.
2025-10-31T13:20:59.617139Z 0 [Note] [MY-000000] [Galera] Member 1.0 (node1) synced with group.
2025-11-02T11:54:22.131770Z 0 [Note] [MY-000000] [Galera] (ff877df9-9f7b, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://10.215.95.111:4567
2025-11-02T11:54:23.623741Z 0 [Note] [MY-000000] [Galera] (ff877df9-9f7b, 'ssl://0.0.0.0:4567') reconnecting to 6263c463-b94c (ssl://10.215.95.111:4567), attempt 0
2025-11-02T11:54:23.624324Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-11-02T11:54:25.124592Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-11-02T11:54:26.625077Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)
2025-11-02T11:54:26.625176Z 0 [Note] [MY-000000] [Galera] evs::proto(ff877df9-9f7b, OPERATIONAL, view_id(REG,2e3e8117-9799,11)) suspecting node: 6263c463-b94c
2025-11-02T11:54:26.625208Z 0 [Note] [MY-000000] [Galera] evs::proto(ff877df9-9f7b, OPERATIONAL, view_id(REG,2e3e8117-9799,11)) suspected node without join message, declaring inactive
2025-11-02T11:54:26.625286Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-11-02T11:54:27.125413Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 inactive (evs.inactive_timeout)
2025-11-02T11:54:27.626022Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-11-02T11:54:27.626689Z 0 [Note] [MY-000000] [Galera] declaring 2e3e8117-9799 at ssl://10.215.95.113:4567 stable
2025-11-02T11:54:27.627081Z 0 [Note] [MY-000000] [Galera] Node 2e3e8117-9799 state primary
2025-11-02T11:54:27.628698Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,2e3e8117-9799,12)
memb {
        2e3e8117-9799,0
        ff877df9-9f7b,0
        }
joined {
        }
left {
        }
partitioned {
        6263c463-b94c,0
        }
)
2025-11-02T11:54:27.628775Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2025-11-02T11:54:27.629398Z 0 [Note] [MY-000000] [Galera] forgetting 6263c463-b94c (ssl://10.215.95.111:4567)
2025-11-02T11:54:27.629448Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2025-11-02T11:54:27.629473Z 0 [Note] [MY-000000] [Galera] (ff877df9-9f7b, 'ssl://0.0.0.0:4567') turning message relay requesting off
2025-11-02T11:54:27.629527Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2025-11-02T11:54:27.629944Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: af139ef4-b7e2-11f0-beab-fa509beab973
2025-11-02T11:54:27.630172Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: af139ef4-b7e2-11f0-beab-fa509beab973 from 0 (node3)
2025-11-02T11:54:27.630224Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: af139ef4-b7e2-11f0-beab-fa509beab973 from 1 (node2)
2025-11-02T11:54:27.630253Z 0 [Note] [MY-000000] [Galera] Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 11,
        members    = 2/2 (primary/total),
        act_id     = 10051641,
        last_appl. = 10051525,
        protocols  = 5/11/4 (gcs/repl/appl),
        vote policy= 0,
        group UUID = e2daaf86-a9d1-11f0-adf4-9f5cd9c4b723






And node3:

2025-10-31T13:20:51.611580Z 25 [Note] [MY-000000] [Galera] ================================================
View:
  id: e2daaf86-a9d1-11f0-adf4-9f5cd9c4b723:9743914
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(3):
        0: 2e3e8117-b62c-11f0-9799-6fa776bf02e9, node3
        1: 6263c463-b62b-11f0-b94c-4276b90ac0ba, node1
        2: ff877df9-b62b-11f0-9f7b-ff64e72ca667, node2
=================================================
2025-10-31T13:20:51.611629Z 25 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2025-10-31T13:20:51.613742Z 25 [Note] [MY-000000] [Galera] Recording CC from group: 9743914
2025-10-31T13:20:51.613776Z 25 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 9743854
2025-10-31T13:20:51.613790Z 25 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 9409337
2025-10-31T13:20:53.215837Z 0 [Note] [MY-000000] [Galera] Member 1.0 (node1) requested state transfer from '*any*'. Selected 2.0 (node2)(SYNCED) as donor.
2025-10-31T13:20:53.663358Z 0 [Note] [MY-000000] [Galera] (2e3e8117-9799, 'ssl://0.0.0.0:4567') turning message relay requesting off
2025-10-31T13:20:55.444899Z 0 [Note] [MY-000000] [Galera] 2.0 (node2): State transfer to 1.0 (node1) complete.
2025-10-31T13:20:55.445411Z 0 [Note] [MY-000000] [Galera] Member 2.0 (node2) synced with group.
2025-10-31T13:20:59.615596Z 0 [Note] [MY-000000] [Galera] 1.0 (node1): State transfer from 2.0 (node2) complete.
2025-10-31T13:20:59.616237Z 0 [Note] [MY-000000] [Galera] Member 1.0 (node1) synced with group.
TRANSACTION 3003742672, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
MySQL thread id 32, OS thread handle 140495565592256, query id 1298224 wsrep: writing rows
TRANSACTION 3003742670, ACTIVE (PREPARED) 0 sec committing
, undo log entries 2
MySQL thread id 24, OS thread handle 140496102463168, query id 1298222 innobase_commit_low (9816332)
2025-11-02T11:54:22.130989Z 0 [Note] [MY-000000] [Galera] (2e3e8117-9799, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://10.215.95.111:4567
2025-11-02T11:54:23.271521Z 0 [Note] [MY-000000] [Galera] (2e3e8117-9799, 'ssl://0.0.0.0:4567') reconnecting to 6263c463-b94c (ssl://10.215.95.111:4567), attempt 0
2025-11-02T11:54:23.272141Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-11-02T11:54:24.772340Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-11-02T11:54:26.272999Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-11-02T11:54:27.125099Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)
2025-11-02T11:54:27.125193Z 0 [Note] [MY-000000] [Galera] evs::proto(2e3e8117-9799, GATHER, view_id(REG,2e3e8117-9799,11)) suspecting node: 6263c463-b94c
2025-11-02T11:54:27.125224Z 0 [Note] [MY-000000] [Galera] evs::proto(2e3e8117-9799, GATHER, view_id(REG,2e3e8117-9799,11)) suspected node without join message, declaring inactive
2025-11-02T11:54:27.625511Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 inactive (evs.inactive_timeout)
2025-11-02T11:54:27.625861Z 0 [Note] [MY-000000] [Galera] declaring ff877df9-9f7b at ssl://10.215.95.112:4567 stable
2025-11-02T11:54:27.626313Z 0 [Note] [MY-000000] [Galera] Node 2e3e8117-9799 state primary
2025-11-02T11:54:27.626647Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,2e3e8117-9799,12)
memb {
        2e3e8117-9799,0
        ff877df9-9f7b,0
        }
joined {
        }
left {
        }
partitioned {
        6263c463-b94c,0
        }
)
2025-11-02T11:54:27.626689Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2025-11-02T11:54:27.627461Z 0 [Note] [MY-000000] [Galera] forgetting 6263c463-b94c (ssl://10.215.95.111:4567)
2025-11-02T11:54:27.627513Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2025-11-02T11:54:27.627553Z 0 [Note] [MY-000000] [Galera] (2e3e8117-9799, 'ssl://0.0.0.0:4567') turning message relay requesting off
2025-11-02T11:54:27.627712Z 0 [Note] [MY-000000] [Galera] STATE_EXCHANGE: sent state UUID: af139ef4-b7e2-11f0-beab-fa509beab973
2025-11-02T11:54:27.629059Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: af139ef4-b7e2-11f0-beab-fa509beab973
2025-11-02T11:54:27.629391Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: af139ef4-b7e2-11f0-beab-fa509beab973 from 0 (node3)
2025-11-02T11:54:27.629436Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: af139ef4-b7e2-11f0-beab-fa509beab973 from 1 (node2)
2025-11-02T11:54:27.629467Z 0 [Note] [MY-000000] [Galera] Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 11,
        members    = 2/2 (primary/total),
        act_id     = 10051641,
        last_appl. = 10051525,
        protocols  = 5/11/4 (gcs/repl/appl),
        vote policy= 0,
        group UUID = e2daaf86-a9d1-11f0-adf4-9f5cd9c4b723

Something happened to this node, 10.215.95.111, at 2025-11-02T11:54:22 because both node2 and node3 attempted to turn on relay requesting, because they lost contact with .111 (node1?)
If that is node1, then for some reason there was an SST issue. But there shouldn’t be any SST happening during normal operations.

Do you have some SST backup job using the garbd, or other scripts that could be interfering with node1 operations?

No, I don’t think we have any interfering jobs. And BTW, node 1 is a PRIMARY node, node2 and node3 are read_only.

UPD: for backups we have a dedicated async replica.
UPD2: my bad, it was marked as suspected after the crash on node 1, so, it’s ok