PXC node crashes during SST with "Will never receive state. Need to abort." consistently at ~8-10 minutes

Yash_Daga · March 25, 2026, 2:57pm

We’re running PXC Operator v1.19.0 with PXC 8.0.42-33.1 on Kubernetes (3-node cluster, ~580 GB dataset, zstd SST compression). When a node needs SST, it fails consistently after ~8 minutes of streaming. This happened 6+ times in a row.

What happens:

SST streams successfully at 15-20 MiB/s for ~8 minutes (~8 GB transferred), then the socat SSL connection between donor and joiner breaks. The donor detects the broken pipe, and the joiner gets partitioned from the cluster and crashes with
signal 11.

Donor logs (pxc-1) showing the failure chain:

14:24:08  socat[195727] E I/O error
14:24:08  donor: => Rate:[4.51MiB/s] Avg:[19.4MiB/s] Elapsed:0:07:45

14:24:09  FATAL ERROR: xtrabackup was not able to send data to the Joiner node.
14:24:09  Within the last 120 seconds (defined by the sst-idle-timeout variable),
the SST process on the donor (this node) has not sent any data to the joiner.
This error could be caused by broken network connectivity between
the donor (this node) and the joiner.
14:24:09  xtrabackup finished with error: 1

In a subsequent attempt, the donor also reported:

14:32:39 xtrabackup_copy_datafile() failed
14:32:39 failed to copy datafile ./stress_test_db/snapshot_data.ibd
14:32:39 failed to copy datafile ./stress_test_db/taskchain_data.ibd
14:32:39 Process completed with error: wsrep_sst_xtrabackup-v2 … : 22 (Invalid argument)
14:32:39 SST sending failed: -22
14:32:39 State transfer to mysqlcluster-pxc-2 failed: Invalid argument

Joiner logs (pxc-2) showing the crash:

14:22:32  joiner: => Rate:[18.4MiB/s] Avg:[21.9MiB/s] Elapsed:0:04:10 Bytes: 5.39GiB  ← streaming fine

14:24:09  State transfer to mysqlcluster-pxc-2 failed: Invalid argument
14:24:09  gcs_group.cpp:gcs_group_handle_join_msg():1334: Will never receive state. Need to abort.
14:24:09  mysqld: Terminated.
14:24:09  mysqld got signal 11
/lib64/libc.so.6(abort+0x178)
/usr/lib64/galera4/libgalera_smm.so(+0x45b09)

14:24:09  Cluster view: NON_PRIM, both other nodes in “partitioned”

Pattern across all attempts:

Attempt 1: Streamed ~8.6 GB in ~9 min, heavy write load (51KB rows), socat I/O error → crash
Attempt 2: Streamed ~8.6 GB in ~9 min, heavy write load, same failure
Attempt 3: Streamed ~8.6 GB in ~9 min, heavy write load, same failure
Attempt 4: Streamed ~8 GB in ~10 min, no write load, same failure
Attempt 5: Streamed ~7.8 GB in ~8 min, no write load, same failure
Attempt 6: Streamed ~8.6 GB in ~8 min, no write load, same failure

The failure happens with and without active write load on the cluster. An earlier SST with a smaller, dataset (7.24 GB after compression) succeeded in 15 minutes, the difference may be timing-related or related to the data volume.

SST configuration:

[xtrabackup]
compress=zstd

[sst]
xbstream-opts=–decompress

wsrep_provider_options=“gcache.size=2G; gcache.recover=yes”

We have not set sst-idle-timeout (using default 120s). The “sst-idle-timeout” error message from the donor appears to be a consequence of the socat break, not the root cause — the socat connection dies first, then the idle timeout detects no
data flow.

Environment: PXC 8.0.42-33.1, Operator v1.19.0, 3-node cluster on Kubernetes, 4 nodes, Linkerd service mesh (port 4444 is in opaque-ports, ports 4567/4568 are in
skip-inbound/outbound-ports), zstd SST compression.

Questions:

What causes the socat SSL connection to break after ~8 minutes? Is there a known issue with socat timeouts or buffer limits during large SST transfers?
Could Linkerd’s proxy (even with port 4444 marked as opaque) be interfering with the long-lived socat SSL connection?
Is the signal 11 on the joiner expected behavior when Galera calls abort() after receiving a failed state transfer message, or is this a bug in the abort handler?
Are there any socat or SST tuning parameters that could help (e.g., socat buffer sizes, keepalive settings)?

@anderson.nogueira

Abhinav_Gupta · March 30, 2026, 3:11pm

You may try sst-idle-timeout=0 for [sst] and see if it works.

[sst]
sst-idle-timeout=0

More Info : Percona XtraBackup SST configuration - Percona XtraDB Cluster

Topic		Replies	Views
Sst-idle-timeout killing sst after the donor has transferred all the data Percona XtraDB Cluster 8.x mysql , percona	2	48	April 15, 2026
SST dies after 120 second Stall after transferring 300+GB on AWS only Percona XtraDB Cluster 5.x mysql	3	1986	June 28, 2023
Randomly IST fail Percona XtraDB Cluster 5.x	4	1105	May 22, 2015
SST Step is failing while 2nd node trying to join the Percona xtradb cluster 8.0 Percona XtraDB Cluster 8.x mysql , percona	3	1338	January 11, 2022
SST Failure in PXC 8: Identical Errors Across Both Cloud and Local Nodes Percona XtraDB Cluster 8.x mysql , percona	11	895	February 5, 2024

PXC node crashes during SST with "Will never receive state. Need to abort." consistently at ~8-10 minutes

Related topics