We’re running PXC Operator v1.19.0 with PXC 8.0.42-33.1 on Kubernetes (3-node cluster, ~580 GB dataset, zstd SST compression). When a node needs SST, it fails consistently after ~8 minutes of streaming. This happened 6+ times in a row.
What happens:
SST streams successfully at 15-20 MiB/s for ~8 minutes (~8 GB transferred), then the socat SSL connection between donor and joiner breaks. The donor detects the broken pipe, and the joiner gets partitioned from the cluster and crashes with
signal 11.
Donor logs (pxc-1) showing the failure chain:
14:24:08 socat[195727] E I/O error
14:24:08 donor: => Rate:[4.51MiB/s] Avg:[19.4MiB/s] Elapsed:0:07:45
14:24:09 FATAL ERROR: xtrabackup was not able to send data to the Joiner node.
14:24:09 Within the last 120 seconds (defined by the sst-idle-timeout variable),
the SST process on the donor (this node) has not sent any data to the joiner.
This error could be caused by broken network connectivity between
the donor (this node) and the joiner.
14:24:09 xtrabackup finished with error: 1
In a subsequent attempt, the donor also reported:
14:32:39 xtrabackup_copy_datafile() failed
14:32:39 failed to copy datafile ./stress_test_db/snapshot_data.ibd
14:32:39 failed to copy datafile ./stress_test_db/taskchain_data.ibd
14:32:39 Process completed with error: wsrep_sst_xtrabackup-v2 … : 22 (Invalid argument)
14:32:39 SST sending failed: -22
14:32:39 State transfer to mysqlcluster-pxc-2 failed: Invalid argument
Joiner logs (pxc-2) showing the crash:
14:22:32 joiner: => Rate:[18.4MiB/s] Avg:[21.9MiB/s] Elapsed:0:04:10 Bytes: 5.39GiB ← streaming fine
14:24:09 State transfer to mysqlcluster-pxc-2 failed: Invalid argument
14:24:09 gcs_group.cpp:gcs_group_handle_join_msg():1334: Will never receive state. Need to abort.
14:24:09 mysqld: Terminated.
14:24:09 mysqld got signal 11
/lib64/libc.so.6(abort+0x178)
/usr/lib64/galera4/libgalera_smm.so(+0x45b09)
14:24:09 Cluster view: NON_PRIM, both other nodes in “partitioned”
Pattern across all attempts:
Attempt 1: Streamed ~8.6 GB in ~9 min, heavy write load (51KB rows), socat I/O error → crash
Attempt 2: Streamed ~8.6 GB in ~9 min, heavy write load, same failure
Attempt 3: Streamed ~8.6 GB in ~9 min, heavy write load, same failure
Attempt 4: Streamed ~8 GB in ~10 min, no write load, same failure
Attempt 5: Streamed ~7.8 GB in ~8 min, no write load, same failure
Attempt 6: Streamed ~8.6 GB in ~8 min, no write load, same failure
The failure happens with and without active write load on the cluster. An earlier SST with a smaller, dataset (7.24 GB after compression) succeeded in 15 minutes, the difference may be timing-related or related to the data volume.
SST configuration:
[xtrabackup]
compress=zstd
[sst]
xbstream-opts=–decompress
wsrep_provider_options=“gcache.size=2G; gcache.recover=yes”
We have not set sst-idle-timeout (using default 120s). The “sst-idle-timeout” error message from the donor appears to be a consequence of the socat break, not the root cause — the socat connection dies first, then the idle timeout detects no
data flow.
Environment: PXC 8.0.42-33.1, Operator v1.19.0, 3-node cluster on Kubernetes, 4 nodes, Linkerd service mesh (port 4444 is in opaque-ports, ports 4567/4568 are in
skip-inbound/outbound-ports), zstd SST compression.
Questions:
- What causes the socat SSL connection to break after ~8 minutes? Is there a known issue with socat timeouts or buffer limits during large SST transfers?
- Could Linkerd’s proxy (even with port 4444 marked as opaque) be interfering with the long-lived socat SSL connection?
- Is the signal 11 on the joiner expected behavior when Galera calls abort() after receiving a failed state transfer message, or is this a bug in the abort handler?
- Are there any socat or SST tuning parameters that could help (e.g., socat buffer sizes, keepalive settings)?