Hi @reddy_nishanth,
I ran additional tests on a 3-node PXC 8.0 cluster to confirm the failure mechanism from your original incident, and I’m also addressing your new March 4 incident below.
Experimental reproduction (original OOM cascade)
Forced one node to need SST (removed its galera.cache while stopped), started the SST, then SIGKILL’d all 3 nodes mid-transfer. After crash, all nodes showed seqno: -1. Bootstrapping from the donor (correct node) recovered all 8,300 rows. Bootstrapping from the SST victim (wrong node) recovered only 5,300 rows, 3,000 rows lost. The victim’s datadir was partially overwritten by xtrabackup before the kill; old data already deleted, new data incomplete.
A separate multi-cycle crash loop test (3 cycles, correct bootstrap each time) showed zero degradation. The data loss only occurs when the operator picks a node whose datadir was damaged by interrupted SST.
The operator’s auto-recovery logic picks the node with the highest seqno from pod logs, but has no memory of previous recovery cycles and no regression detection. If a mid-SST victim reports a low seqno and still gets selected (because other nodes also report -1), the stale data becomes the new cluster state. In 108 cycles, even one wrong pick propagates the loss permanently.
Your new incident (March 4): liveness probe killing SST donor
This time the trigger was different. The SmartUpdate required a full SST, and PXC-1 became the donor. During heavy xtrabackup I/O, the liveness probe failed in two ways:
- The script timed out (>15s) under xtrabackup I/O contention
- It checks
wsrep_cluster_statusand treatsnon-Primary(expected Donor/Desynced state) as unhealthy
After 5 failures, kubelet killed PXC-1 mid-SST, which cascaded into the SEGFAULT on PXC-0 and PXC-2 and the current deadlock.
This is a known issue tracked in K8SPXC-1724 (targeted for operator 1.20.0). Tune your CR to prevent it:
pxc:
livenessProbes:
initialDelaySeconds: 300
timeoutSeconds: 30 # was 15
periodSeconds: 30 # was 10
successThreshold: 1
failureThreshold: 10 # was 5
This gives ~5 minutes (30s x 10) before kubelet kills a donor.
The SEGFAULT in libgalera_smm.so when the donor is killed mid-SST is related to PXC-4285. You already have earlier Galera fixes on 8.0.42, but this variant may still exist. Full backtrace from the error log would help engineering reproduce it.
Recovery for the current deadlock
All nodes have seqno -1 and PXC-1 is stuck sending 0 bytes. To recover:
- Scale down:
kubectl scale --replicas=0 statefulset/mysqlcluster-pxc -n percona-operator - Run
mysqld --wsrep-recoveron each PV to find actual seqno (the -1 ingrastate.datjust means unclean shutdown) - Bootstrap from the node with the highest seqno
- Apply the liveness tuning above before letting others rejoin via SST
Key mitigations
- Right-size memory limits. Increasing to 24Gi is the right call. Set
innodb_buffer_pool_sizeto 50-60% of the container limit to leave headroom for SST operations. The operator auto-tunes to 75%, which leaves too little for xtrabackup receive + decompress. - Consider
autoRecovery: falsein environments prone to cascading failures. See Operator crash recovery docs. - Upgrade PXC to 8.0.43+. Fixes the SST idle timeout bug (PXC-4392) where
sst-idle-timeoutcould prematurely abort large SSTs. - Always verify seqno before bootstrap. Run
mysqld --wsrep-recoveron every node. Never trustgrastate.datalone after a crash. See PXC bootstrap procedure. - Verify data integrity after recovery. Run
pt-table-checksumacross all nodes. See SST internals. - Plan migration to PXC 8.4 LTS. MySQL 8.0 Extended Support ends April 30, 2026.
References:
- K8SPXC-1724 (operator killing SST): Jira
- K8SPXC-824: Jira
- PXC-4285 (signal 11 on SST error): Jira
- PXC-4392: Jira
- Operator probe config: Custom Resource options - Percona Operator for MySQL
- Crash recovery: Crash recovery - Percona XtraDB Cluster
- Operator recovery: Crash recovery - Percona Operator for MySQL