Full crash Percona after OOM

Hi @reddy_nishanth,

I ran additional tests on a 3-node PXC 8.0 cluster to confirm the failure mechanism from your original incident, and I’m also addressing your new March 4 incident below.

Experimental reproduction (original OOM cascade)

Forced one node to need SST (removed its galera.cache while stopped), started the SST, then SIGKILL’d all 3 nodes mid-transfer. After crash, all nodes showed seqno: -1. Bootstrapping from the donor (correct node) recovered all 8,300 rows. Bootstrapping from the SST victim (wrong node) recovered only 5,300 rows, 3,000 rows lost. The victim’s datadir was partially overwritten by xtrabackup before the kill; old data already deleted, new data incomplete.

A separate multi-cycle crash loop test (3 cycles, correct bootstrap each time) showed zero degradation. The data loss only occurs when the operator picks a node whose datadir was damaged by interrupted SST.

The operator’s auto-recovery logic picks the node with the highest seqno from pod logs, but has no memory of previous recovery cycles and no regression detection. If a mid-SST victim reports a low seqno and still gets selected (because other nodes also report -1), the stale data becomes the new cluster state. In 108 cycles, even one wrong pick propagates the loss permanently.

Your new incident (March 4): liveness probe killing SST donor

This time the trigger was different. The SmartUpdate required a full SST, and PXC-1 became the donor. During heavy xtrabackup I/O, the liveness probe failed in two ways:

  1. The script timed out (>15s) under xtrabackup I/O contention
  2. It checks wsrep_cluster_status and treats non-Primary (expected Donor/Desynced state) as unhealthy

After 5 failures, kubelet killed PXC-1 mid-SST, which cascaded into the SEGFAULT on PXC-0 and PXC-2 and the current deadlock.

This is a known issue tracked in K8SPXC-1724 (targeted for operator 1.20.0). Tune your CR to prevent it:

pxc:
  livenessProbes:
    initialDelaySeconds: 300
    timeoutSeconds: 30       # was 15
    periodSeconds: 30        # was 10
    successThreshold: 1
    failureThreshold: 10     # was 5

This gives ~5 minutes (30s x 10) before kubelet kills a donor.

The SEGFAULT in libgalera_smm.so when the donor is killed mid-SST is related to PXC-4285. You already have earlier Galera fixes on 8.0.42, but this variant may still exist. Full backtrace from the error log would help engineering reproduce it.

Recovery for the current deadlock

All nodes have seqno -1 and PXC-1 is stuck sending 0 bytes. To recover:

  1. Scale down: kubectl scale --replicas=0 statefulset/mysqlcluster-pxc -n percona-operator
  2. Run mysqld --wsrep-recover on each PV to find actual seqno (the -1 in grastate.dat just means unclean shutdown)
  3. Bootstrap from the node with the highest seqno
  4. Apply the liveness tuning above before letting others rejoin via SST

Key mitigations

  • Right-size memory limits. Increasing to 24Gi is the right call. Set innodb_buffer_pool_size to 50-60% of the container limit to leave headroom for SST operations. The operator auto-tunes to 75%, which leaves too little for xtrabackup receive + decompress.
  • Consider autoRecovery: false in environments prone to cascading failures. See Operator crash recovery docs.
  • Upgrade PXC to 8.0.43+. Fixes the SST idle timeout bug (PXC-4392) where sst-idle-timeout could prematurely abort large SSTs.
  • Always verify seqno before bootstrap. Run mysqld --wsrep-recover on every node. Never trust grastate.dat alone after a crash. See PXC bootstrap procedure.
  • Verify data integrity after recovery. Run pt-table-checksum across all nodes. See SST internals.
  • Plan migration to PXC 8.4 LTS. MySQL 8.0 Extended Support ends April 30, 2026.

References:

1 Like

Hi @anderson.nogueira Thanks a lot for your response!

I’m pasting more logs around the SEGFAULT in libgalera_smm.so

The sequence from the joiner’s(pxc-0) error log:

2026-03-03T07:42:31.113304Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR **********************
2026-03-03T07:42:31.113329Z 0 [ERROR] [MY-000000] [WSREP-SST] xtrabackup_checkpoints missing. xtrabackup/SST failed on DONOR. Check
DONOR log
2026-03-03T07:42:31.113335Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 2470
2026-03-03T07:42:31.113352Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************************************************
2026-03-03T07:42:31.113421Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:2
2026-03-03T07:42:31.268707Z 0 [Warning] [MY-000000] [Galera] 1.0 (mysqlcluster-pxc-1): State transfer to 0.0 (mysqlcluster-pxc-0)
failed: Invalid argument
2026-03-03T07:42:31.268755Z 0 [ERROR] [MY-000000] [Galera]
../../../../percona-xtradb-cluster-galera/gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1334: Will never receive state. Need to
abort.
2026-03-03T07:42:31.269426Z 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation
2026-03-03T07:42:31.269437Z 0 [Note] [MY-000000] [WSREP] Terminating SST process
2026-03-03T07:42:31Z UTC - mysqld got signal 11 ;
BuildID[sha1]=fd50c479d409fa6f6df793b901773f54c51aed34
Server Version: 8.0.42-33.1 Percona XtraDB Cluster (GPL), Release rel33, Revision 6673f8e, WSREP version 26.1.4.3, wsrep_26.1.4.3

stack_bottom = 0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x219b71e]
/usr/sbin/mysqld(print_fatal_signal(int)+0x37f) [0x12688af]
/usr/sbin/mysqld(handle_fatal_signal+0xd0) [0x1268990]
/lib64/libc.so.6(+0x3fc30) [0x7449a8e11c30]
/lib64/libc.so.6(abort+0x178) [0x7449a8dfb918]
/usr/lib64/galera4/libgalera_smm.so(+0x45b09) [0x74499c39cb09]
/usr/lib64/galera4/libgalera_smm.so(+0xdb936) [0x74499c432936]
/usr/lib64/galera4/libgalera_smm.so(+0xb638a) [0x74499c40d38a]
/lib64/libc.so.6(+0x8b2ea) [0x7449a8e5d2ea]
/lib64/libc.so.6(+0x1103c0) [0x7449a8ee23c0]

Donor’s logs:

2026-03-03T07:48:04.302280Z 0 [ERROR] [WSREP-SST] xtrabackup finished with error: 2. Check /var/lib/mysql//innobackup.backup.log
  2026-03-03T07:48:04.318432Z 0 [Note] [Galera] SST sending failed: -22

innobackup.backup.log:

2026-03-03T07:48:02.660985-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized server arguments: --datadir=/var/lib/mysql --server-id=16892861 --innodb_flush_log_at_trx_commit=2 --innodb_flush_method=O_DIRECT --innodb_file_per_table=1 --innodb_buffer_pool_size=12893290496 --innodb_flush_method=O_DIRECT --innodb_flush_log_at_trx_commit=1 --defaults_group=mysqld

2026-03-03T07:48:02.661141-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized client arguments: --socket=/tmp/mysql.sock --compress=lz4 --no-version-check=1 --parallel=4 --user=mysql.pxc.sst.user --password=* --socket=/tmp/mysql.sock --lock-ddl=1 --backup=1 --galera-info=1 --stream=xbstream --xtrabackup-plugin-dir=/usr/bin/pxc_extra/pxb-8.0/lib/plugin --target-dir=/tmp/pxc_sst_glRh/donor_xb_rzCB

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup version 8.0.35-31 based on MySQL server 8.0.35 Linux (x86_64) (revision id: 55ec21d7)

2026-03-03T07:48:02.661173-00:00 0 [Note] [MY-011825] [Xtrabackup] Connecting to MySQL server host: localhost, user: mysql.pxc.sst.user, password: set, port: not set, socket: /tmp/mysql.sock

2026-03-03T07:48:02.667068-00:00 0 [Note] [MY-011825] [Xtrabackup] Using server version 8.0.42-33.1

2026-03-03T07:48:02.767586-00:00 0 [Note] [MY-011825] [Xtrabackup] Executing LOCK TABLES FOR BACKUP ...

2026-03-03T07:48:02.769201-00:00 0 [Note] [MY-011825] [Xtrabackup] uses posix_fadvise().

2026-03-03T07:48:02.769246-00:00 0 [Note] [MY-011825] [Xtrabackup] cd to /var/lib/mysql

2026-03-03T07:48:02.769257-00:00 0 [Note] [MY-011825] [Xtrabackup] open files limit requested 0, set to 1000000

2026-03-03T07:48:03.921703-00:00 0 [Note] [MY-011825] [Xtrabackup] using the following InnoDB configuration:

2026-03-03T07:48:03.921751-00:00 0 [Note] [MY-011825] [Xtrabackup] innodb_data_home_dir = .

2026-03-03T07:48:03.921759-00:00 0 [Note] [MY-011825] [Xtrabackup] innodb_data_file_path = ibdata1:12M:autoextend

2026-03-03T07:48:03.921790-00:00 0 [Note] [MY-011825] [Xtrabackup] innodb_log_group_home_dir = ./

2026-03-03T07:48:03.921798-00:00 0 [Note] [MY-011825] [Xtrabackup] innodb_log_files_in_group = 2

2026-03-03T07:48:03.921807-00:00 0 [Note] [MY-011825] [Xtrabackup] innodb_log_file_size = 50331648

2026-03-03T07:48:03.921818-00:00 0 [Note] [MY-011825] [Xtrabackup] using O_DIRECT

2026-03-03T07:48:03.923606-00:00 0 [Note] [MY-011825] [Xtrabackup] inititialize_service_handles suceeded

2026-03-03T07:48:04.092486-00:00 0 [Note] [MY-011825] [Xtrabackup] Connecting to MySQL server host: localhost, user: mysql.pxc.sst.user, password: set, port: not set, socket: /tmp/mysql.sock

2026-03-03T07:48:04.097848-00:00 0 [Note] [MY-011825] [Xtrabackup] Redo Log Archiving is not set up.

2026-03-03T07:48:04.196156-00:00 0 [Note] [MY-011825] [Xtrabackup] Starting to parse redo log at lsn = 38240263354

2026-03-03T07:48:04.198557-00:00 0 [Note] [MY-012564] [InnoDB] Recovery parsing buffer extended to 4194304.

2026-03-03T07:48:04.199858-00:00 0 [Note] [MY-012564] [InnoDB] Recovery parsing buffer extended to 8388608.

2026-03-03T07:48:04Z UTC - mysqld got signal 6 ;

Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.

BuildID[sha1]=

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0 thread_stack 0x100000

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x18df6f1]

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup(print_fatal_signal(int)+0x3bc) [0xddf5bc]

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup(handle_fatal_signal+0x95) [0xddf665]

/lib64/libc.so.6(+0x3fc30) [0x7dbb7cc73c30]

/lib64/libc.so.6(+0x8d02c) [0x7dbb7ccc102c]

/lib64/libc.so.6(raise+0x16) [0x7dbb7cc73b86]

/lib64/libc.so.6(abort+0xd3) [0x7dbb7cc5d873]

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup() [0x8a1786]

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup() [0x8a9ba7]

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup(Redo_Log_Writer::write_buffer(unsigned char*, unsigned long)+0x1a4) [0x905c74]

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup(Redo_Log_Data_Manager::start()+0xcd) [0x916dfd]

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup(xtrabackup_backup_func()+0x958) [0x8cf408]

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup(main+0x14ac) [0x871eec]

/lib64/libc.so.6(+0x2a610) [0x7dbb7cc5e610]

/lib64/libc.so.6(__libc_start_main+0x80) [0x7dbb7cc5e6c0]

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup(_start+0x25) [0x89e195]

Please report a bug at ``https://jira.percona.com/projects/PXB

Do you think this could be related to Jira ?

Folks I don’t have anything to add to recent investigation but just wanted to mention that I created K8SPXC-1828 for improving auto recovery in environments where PXC pods are killed often.

1 Like

Hi @reddy_nishanth, @Yash_Daga,

Thanks for the additional logs from the donor and joiner. They reveal that you actually hit two separate bugs on consecutive days, both leading to the same Galera SEGFAULT downstream.

Mar 3 (Yash’s logs): xtrabackup LZ4 crash on the donor

The donor’s innobackup.backup.log shows xtrabackup 8.0.35-31 crashed with signal 6 in Redo_Log_Writer::write_buffer right after parsing the redo log. The backup was using --compress=lz4. This is an exact match for PXB-3568 (Open, Urgent): signal 6 in the same stack trace (Redo_Log_Writer::write_bufferRedo_Log_Data_Manager::startxtrabackup_backup_func), same xtrabackup version, same LZ4 compression. PXB-3568 is linked to PKG-842 (“PXC 8.0.42 docker image does not work with compress=lz4”, fixed). The joiner then got “xtrabackup_checkpoints missing” because the donor never finished the backup, followed by signal 11 in libgalera_smm.so during SST cancellation, which is the same PXC-4821 race condition.

So yes, Yash, PXB-3568 is the right ticket for the donor crash. Upgrading PXC to 8.0.45-36 would pull in a newer xtrabackup that includes the fix from PKG-842.

Mar 4 (reddy_nishanth’s timeline): liveness probe kills donor during SST

This is the incident I analyzed in my previous post. The failure chain is different: the liveness probe’s MySQL query timed out under xtrabackup I/O load, kubelet killed the donor after 5 consecutive failures, and the joiners crashed with the same PXC-4821 SEGFAULT. To correct something from my earlier reply (#21): I said the donor reports non-Primary, but wsrep_cluster_status is a cluster-level variable and a donor stays in the Primary component during SST. The probe failure is primarily from the query timing out under I/O pressure, not from a non-Primary status. The sst_in_progress sentinel file that suppresses liveness checks is only created on the joiner side, not the donor, so the donor has no protection.

Both incidents share the same downstream crash (PXC-4821): when the donor dies mid-SST for any reason, the joiner’s Galera library hits a race condition in the SST cancellation path and SEGFAULTs. This is still open and unfixed in 8.0.45.

What to fix (updated from my previous recommendations):

  1. Widen liveness probes to prevent the Mar 4 failure path:
spec:
  pxc:
    livenessProbes:
      initialDelaySeconds: 300
      timeoutSeconds: 30
      periodSeconds: 30
      failureThreshold: 10
  1. Cap the buffer pool and SST memory explicitly (prevents OOM on joiners):
spec:
  pxc:
    configuration: |
      [mysqld]
      innodb_buffer_pool_size=12G

      [sst]
      inno-apply-opts="--use-memory=1G"
    resources:
      requests:
        memory: "24Gi"
      limits:
        memory: "24Gi"

The inno-apply-opts line caps the memory xtrabackup uses during SST prepare on the joiner. Setting requests equal to limits gives Guaranteed QoS class. A safe rule of thumb is container limit >= buffer pool + 6Gi. If you have thousands of tables, consider 32Gi instead, as the InnoDB dictionary cache can consume several extra GiB during tablespace import.

  1. Upgrade PXC to 8.0.45-36. This addresses the Mar 3 xtrabackup crash (PXB-3568/PKG-842), plus PXC-4631 (8.0.43, stops SST after failed IST), PXC-4756 (8.0.44, SST timeout with large buffer pools), and PXC-4845 (8.0.45, IST failure handling). The Galera SEGFAULT (PXC-4821) is still open, but the upgrade eliminates two of the three triggers you hit.

Recovery for the stuck cluster: since all nodes show seqno: -1, run wsrep_recover on each pod to find the actual last committed position. The manual crash recovery procedure covers this. Bootstrap from the node with the highest recovered position. If liveness kills pods during recovery, touch /var/lib/mysql/sst_in_progress on the affected pod to suppress checks; remove it once the cluster is back.

@Ege_Gunes thanks for creating K8SPXC-1828 for the auto-recovery improvement. That would help prevent the 108-cycle cascade from the January incident.

Can you share kubectl describe pod for PXC-1 around 08:53 UTC on Mar 4? The Events section would confirm the kill reason. Also, what is your gcache.size? If IST can succeed instead of requiring full SST, most of these failure chains never start. And if you have the grastate.dat contents from each pod, those would help confirm the recovery path.

1 Like

Thanks a lot for the suggestions @anderson.nogueira .

I couldn’t find the events from Mar4, but the gcache size is 2Gb and here are the grastate files stored in disk.
Also wanted to ask if we can avoid the lz4 bug if we switch to zstd compression?

=== mysqlcluster-pxc-0 ===

version: 2.1
uuid:    7e6d8fe7-13cc-11f1-8cbd-d3929876ba7c
seqno:   -1
safe_to_bootstrap: 0
=== mysqlcluster-pxc-1 ===

version: 2.1
uuid:    7e6d8fe7-13cc-11f1-8cbd-d3929876ba7c
seqno:   -1
safe_to_bootstrap: 0
=== mysqlcluster-pxc-2 ===

version: 2.1
uuid:    7e6d8fe7-13cc-11f1-8cbd-d3929876ba7c
seqno:   -1
safe_to_bootstrap: 0

Hi @Yash_Daga, @reddy_nishanth,

Switching to zstd compression

Yes, but it depends on which compression mechanism you change. There are two separate options:

  1. xtrabackup internal compression ([xtrabackup] section, compress option): Your donor is currently using --compress=lz4. You can switch to compress=zstd in the [xtrabackup] section of your custom my.cnf (via pxc.configuration in the CR). This uses libzstd internally, which is already present in the container, so no external binary is needed. Since PXB-3568 is specific to the LZ4 code path (Redo_Log_Writer::write_buffer crash with signal 6), switching to zstd should avoid it. In fact, xtrabackup 8.0.35-31 already defaults to zstd when compress is specified without an explicit algorithm.

  2. SST stream compression ([sst] section, compressor option): This invokes an external binary to compress the xbstream. Setting compressor='zstd' will fail because the zstd command-line tool is not included in PXC container images (I verified on 8.0.45: libzstd is present but the zstd binary is not). This is tracked in PXC-4812 (Pending Release).

So the quickest workaround is changing the xtrabackup compression:

pxc:
  configuration: |
    [xtrabackup]
    compress=zstd

If that still causes issues, compress=none disables compression entirely as a safe fallback.

Upgrading to PXC 8.0.45-36 is also recommended: PKG-842 (Done) fixed the Docker-specific LZ4 issue, though PXB-3568 itself remains Open.

Liveness probe tuning (clarification on #21)

Setting pxc.livenessProbes.timeoutSeconds in the CR is sufficient. The operator propagates this value as the LIVENESS_CHECK_TIMEOUT env var to the PXC container automatically (source). The liveness script computes its internal timeout as LIVENESS_CHECK_TIMEOUT - 1, so with timeoutSeconds: 30, the mysql client gets 29 seconds. No separate env var configuration is needed.

pxc:
  livenessProbes:
    timeoutSeconds: 30
    periodSeconds: 30
    failureThreshold: 10

With these values, the donor gets a 300-second window (10 failures x 30s period) before kubelet kills it. The sst_in_progress guard in liveness-check.sh (line 12) only protects the joiner (the file is created in the joiner branch of wsrep_sst_xtrabackup-v2.sh, line 2152). The donor has no equivalent protection, which is why the probe times out under heavy backup I/O.

Recovery

All three nodes show seqno: -1 with the same UUID (7e6d8fe7...), so the cluster identity is intact. Follow the mysqld --wsrep-recover procedure from post #18 on each PV to find the node with the highest recovered seqno, then bootstrap from that node.

References:

1 Like

Hi @anderson.nogueira

I upgraded to 8.0.45, but I was still able to reproduce the sst error with lz4 compression, can you confirm if that’s fixed in 8.0.45?

Unfortunately, no. PXC 8.0.45 does not fix the LZ4 crash. Sorry for the confusion in my earlier reply.

PKG-842 (the fix in 8.0.45) only addressed the SST script version check error (Cannot determine the xtrabackup 2.x version). The actual LZ4 crash (PXB-3568, Signal 6 in XtraBackup) is still open with no fix version. Both 8.0.42 and 8.0.45 ship the same XtraBackup (8.0.35-35), so the crash behavior is identical.

The compress=zstd workaround from post #29 is the recommended path:

[xtrabackup]
compress=zstd

If zstd causes issues, removing the compress line entirely disables compression.

One correction to post #29: you do not need [sst] xbstream-opts=--decompress for this. The SST script handles decompression automatically on the joiner side. That option is only relevant if you use [sst] stream compression separately.

Before applying this change in production, please test on a non-production cluster or a copy of your database first. Force a full SST of one node under representative write load and verify:

  1. The joining node reaches Synced state
  2. Data integrity is intact (run a checksum or spot-check key tables)
  3. No errors in the donor or joiner xtrabackup logs

Could you try compress=zstd and let us know if SST completes successfully? And if you still have the error output from your 8.0.45 LZ4 reproduction, sharing the donor-side backtrace lines would help confirm it matches PXB-3568.