Full crash Percona after OOM

Description:

Why did my cluster crash and become non-recoverable after OOM?

Steps to Reproduce:

PXC cluster pods did OOM, and there were many restarts

polaris@RKLAB-RVMHM255S006481:~$ k get all -n percona-operator
NAME                                                   READY   STATUS                       RESTARTS           AGE
pod/mysqlcluster-haproxy-0                             1/2     Running                      1508 (11s ago)     22d
pod/mysqlcluster-haproxy-1                             1/2     CrashLoopBackOff             1512 (2m1s ago)    22d
pod/mysqlcluster-haproxy-2                             1/2     Running                      1435 (2m35s ago)   22d
pod/mysqlcluster-pxc-0                                 1/1     Running                      545 (4d14h ago)    22d
pod/mysqlcluster-pxc-1                                 1/1     Running                      661 (4d14h ago)    22d
pod/mysqlcluster-pxc-2                                 1/1     Running                      212 (4d14h ago)    22d
pod/percona-xtradb-cluster-operator-584685d9df-sl7mv   1/1     Running                      1 (31h ago)        22d
pod/xb-backup-260122-1134-44wmj                        0/1     CreateContainerConfigError   0                  12d

NAME                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                          AGE
service/mysqlcluster-haproxy              ClusterIP   100.96.148.158   <none>        3306/TCP,3309/TCP,33062/TCP,33060/TCP,8404/TCP   22d
service/mysqlcluster-haproxy-replicas     ClusterIP   100.96.64.20     <none>        3306/TCP                                         22d
service/mysqlcluster-pxc                  ClusterIP   None             <none>        3306/TCP,33062/TCP,33060/TCP                     22d
service/mysqlcluster-pxc-unready          ClusterIP   None             <none>        3306/TCP,33062/TCP,33060/TCP                     22d
service/percona-xtradb-cluster-operator   ClusterIP   100.96.204.93    <none>        443/TCP                                          22d

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/percona-xtradb-cluster-operator   1/1     1            1           22d

NAME                                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/percona-xtradb-cluster-operator-584685d9df   1         1         1       22d

NAME                                    READY   AGE
statefulset.apps/mysqlcluster-haproxy   0/3     22d
statefulset.apps/mysqlcluster-pxc       3/3     22d

Reason was OOM killed

Post that, all of them started reporting -1

polaris@RKLAB-RVMHM255S006481:~$ sudo kubectl exec -n percona-operator mysqlcluster-pxc-0 -c pxc -- cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    00000000-0000-0000-0000-000000000000
seqno:   -1
safe_to_bootstrap: 0
     polaris@RKLAB-RVMHM255S006481:~$ kubectl exec -n percona-operator mysqlcluster-pxc-1 -c pxc -- cat /var/lib/mysql/grastate.date.dat
# GALERA saved state
version: 2.1
uuid:    00000000-0000-0000-0000-000000000000
seqno:   -1
safe_to_bootstrap: 0
     polaris@RKLAB-RVMHM255S006481:~$ kubectl exec -n percona-operator mysqlcluster-pxc-2 -c pxc -- cat /var/lib/mysql/grastate.date.dat
# GALERA saved state
version: 2.1
uuid:    00000000-0000-0000-0000-000000000000
seqno:   -1
safe_to_bootstrap: 0

And logs from one the pxc-1 shows that highest sequence number is -1

#####################################################FULL_PXC_CLUSTER_CRASH:mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local#####################################################
You have the situation of a full PXC cluster crash. In order to restore your PXC cluster, please check the log
from all pods/nodes to find the node with the most recent data (the one with the highest sequence number (seqno).
It is mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local node with sequence number (seqno): -1
Cluster will recover automatically from the crash now.
If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:
kubectl -n percona-operator exec mysqlcluster-pxc-1 -c pxc -- sh -c 'kill -s USR1 1'
#####################################################LAST_LINE:mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local:-1:#####################################################
polaris@RKLAB-RVMHM255S006481:~$

Version:

1.17.0 version operator

Logs:

Attached above, additional logs, saw that there is -rw-rw---- 1 mysql mysql 0 Jan 30 01:16 sst_in_progress on Jan 30 when this all happened

polaris@RKLAB-RVMHM255S006481:~$ sudo kubectl exec -n percona-operator mysqlcluster-pxc-0 -c pxc -- ls -la /var/lib/mysql/
total 2274548
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:37 #ib_16384_0.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_1.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:39 #ib_16384_10.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_11.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:37 #ib_16384_12.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_13.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:39 #ib_16384_14.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_15.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:37 #ib_16384_2.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_3.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:39 #ib_16384_4.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_5.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:39 #ib_16384_6.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_7.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:39 #ib_16384_8.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_9.dblwr
drwxr-s--- 2 mysql  mysql        4096 Feb  3 16:30 #innodb_redo
drwxr-s--- 2 mysql  mysql         187 Feb  3 16:30 #innodb_temp
drwxrwsrwx 7 root   mysql       12288 Feb  3 16:37 .
drwxr-xr-x 1 root   root           19 May 18  2023 ..
-rw-rw---- 1 mysql  mysql         399 Jan 27 13:47 GRA_10_3101111_v2.log
-rw-rw---- 1 mysql  mysql         386 Jan 27 13:47 GRA_10_3102292_v2.log
-rw-rw---- 1 mysql  mysql         376 Jan 27 13:47 GRA_10_3102569_v2.log
-rw-rw---- 1 mysql  mysql         376 Jan 27 13:47 GRA_10_3102601_v2.log
-rw-rw---- 1 mysql  mysql         492 Jan 27 14:03 GRA_11_3159148_v2.log
-rw-rw---- 1 mysql  mysql         492 Jan 27 14:03 GRA_11_3163950_v2.log
-rw-rw---- 1 mysql  mysql         483 Jan 27 14:04 GRA_11_3170913_v2.log
-rw-rw---- 1 mysql  mysql         456 Jan 27 14:04 GRA_11_3170917_v2.log
-rw-rw---- 1 mysql  mysql         376 Jan 27 14:03 GRA_1_3159556_v2.log
-rw-rw---- 1 mysql  mysql         492 Jan 27 14:04 GRA_1_3168943_v2.log
-rw-rw---- 1 mysql  mysql         376 Jan 27 14:04 GRA_1_3169716_v2.log
-rw-rw---- 1 mysql  mysql         371 Jan 27 14:04 GRA_1_3170914_v2.log
-rw-rw---- 1 mysql  mysql         371 Jan 27 14:04 GRA_1_3175329_v2.log
-rw-rw---- 1 mysql  mysql         492 Jan 27 13:46 GRA_2_3097310_v2.log
-rw-rw---- 1 mysql  mysql         376 Jan 27 13:47 GRA_2_3100880_v2.log
-rw-rw---- 1 mysql  mysql         492 Jan 27 13:47 GRA_2_3102064_v2.log
-rw-rw---- 1 mysql  mysql         412 Jan 27 13:47 GRA_2_3102293_v2.log
-rw-rw-r-- 1 mysql  mysql          22 Feb  3 16:29 auth_plugin
-rw-r----- 1 mysql  mysql          56 Feb  3 16:29 auto.cnf
-rw-r----- 1 mysql  mysql         180 Feb  3 16:30 binlog.000001
-rw-r----- 1 mysql  mysql         201 Feb  3 16:31 binlog.000002
-rw-r----- 1 mysql  mysql         157 Feb  3 16:31 binlog.000003
-rw-r----- 1 mysql  mysql          48 Feb  3 16:31 binlog.index
-rw-rw---- 1 mysql  mysql  2147484952 Feb  3 16:37 galera.cache
-rwxr-xr-x 1 daemon daemon       1138 Feb  3 16:29 get-pxc-state
-rw-rw---- 1 mysql  mysql         118 Feb  3 16:37 grastate.dat
-rw-r----- 1 mysql  mysql         264 Feb  3 16:37 gvwstate.dat
-rw-r----- 1 mysql  mysql        6638 Feb  3 16:30 ib_buffer_pool
-rw-r----- 1 mysql  mysql    12582912 Feb  3 16:37 ibdata1
-rw-r----- 1 mysql  mysql    12582912 Feb  3 16:30 ibtmp1
-rw-rw---- 1 mysql  mysql       44185 Feb  3 16:31 innobackup.backup.log
-rw-rw---- 1 mysql  mysql    76004458 Jan 27 14:02 innobackup.move.log
-rw-rw---- 1 mysql  mysql       33821 Jan 27 14:02 innobackup.prepare.log
-rwxr-xr-x 1 daemon daemon       1708 Feb  3 16:29 liveness-check.sh
drwxr-s--- 2 mysql  mysql         232 Feb  3 16:29 mysql
-rwxr-xr-x 1 daemon daemon    1740952 Feb  3 16:29 mysql-state-monitor
-rw-rw-r-- 1 mysql  mysql        2046 Feb  3 16:30 mysql-state-monitor.log
-rw-r----- 1 mysql  mysql    31457280 Feb  3 16:37 mysql.ibd
-rw-rw-r-- 1 mysql  mysql         147 Feb  3 16:30 mysql.state
-rw-r----- 1 mysql  mysql         792 Feb  3 16:30 mysqlcluster-pxc-0-slow.log
-rw-r----- 1 mysql  mysql           2 Feb  3 16:30 mysqlcluster-pxc-0.pid
-rw-rw---- 1 mysql  mysql        4489 Jan 27 14:02 mysqld.post.processing.log
srwxrwxrwx 1 mysql  mysql           0 Feb  3 16:30 mysqlx.sock
-rw------- 1 mysql  mysql           2 Feb  3 16:30 mysqlx.sock.lock
srwxr-xr-x 1 mysql  mysql           0 Feb  3 16:29 notify.sock
-rwxr-xr-x 1 daemon daemon    3852289 Feb  3 16:29 peer-list
drwxr-s--- 2 mysql  mysql        8192 Feb  3 16:29 performance_schema
-rwxr-xr-x 1 daemon daemon        830 Feb  3 16:29 pmm-prerun.sh
-rw------- 1 mysql  mysql        1680 Feb  3 16:29 private_key.pem
-rw-r--r-- 1 mysql  mysql         452 Feb  3 16:29 public_key.pem
-rwxr-xr-x 1 daemon daemon       5471 Feb  3 16:29 pxc-configure-pxc.sh
-rwxr-xr-x 1 daemon daemon      25159 Feb  3 16:29 pxc-entrypoint.sh
-rwxr-xr-x 1 daemon daemon       1470 Feb  3 16:29 readiness-check.sh
-rw-rw---- 1 mysql  mysql           0 Jan 30 01:16 sst_in_progress
drwxr-s--- 2 mysql  mysql          28 Feb  3 16:29 sys
-rw-r----- 1 mysql  mysql    16777216 Feb  3 16:39 undo_001
-rw-r----- 1 mysql  mysql    16777216 Feb  3 16:39 undo_002
-rwxr-xr-x 1 daemon daemon        543 Feb  3 16:29 unsafe-bootstrap.sh
-rw-rw-r-- 1 mysql  mysql         142 Feb  3 16:30 version_info
-rwxr-xr-x 1 daemon daemon        618 Feb  3 16:29 wsrep_cmd_notify_handler.sh
-rw-rw-r-- 1 mysql  mysql       98344 Jan 27 14:14 wsrep_recovery_verbose_history.log

Expected Result:

Expected that after OOM, the data is still present

Actual Result:

Tables etc were deleted

Additional Information:

How to debug this and know what caused this failure

Even if pods killed by OOM during SST data should be intact on the donor node. I wonder whether operator bootstrapped the cluster from the wrong node. Do you have operator logs from that time?

Hi, Looking at the operator logs, here is the summary

Jan 11, 2026 15:58 Operator started successfully, cluster was healthy (PXC version 8.0.32-24.2)
Jan 11, 2026 18:14 First connection issues appear (pxc-2 connection refused)
Jan 12-14 Intermittent connection issues continue (connection refused, connection reset by peer)
Jan 25, 2026 20:28 First “cluster is not ready” message appears
Jan 27, 2026 13:39 WSREP errors begin - “WSREP has not yet prepared node for application use”
Jan 27-28 Continuous WSREP errors, cluster completely down
Jan 28, 2026 00:01 Connection refused errors to all nodes (pxc-0, pxc-1)
Jan 28 onwards Cluster remains down with continuous reconcile errors

The exact logs is

2026-01-27T13:39:18+00:00 percona-xtradb-cluster-operator {“log”:"2026-01-27T13:39:18.495Z\tINFO\treconcile replication error\t{“controller”: “pxc-controller”, “namespace”: “percona-operator”, “name”: “mysqlcluster”, “reconcileID”: “8aa6d2ee-b277-4561-96c3-be636f271208”, “err”: “remove outdated replication channels: get current replication channels: select current replication channels: Error 1047 (08S01): WSREP has not yet prepared node for application use”}

let me know if you want to me to seach for some specific errors, the logs file is huge

Do you have Results of scanning sequences logs?

Summary

Field Value
Timestamp 2026-01-29T07:44:49.550Z
Level INFO
Message Results of scanning sequences
Controller pxc-controller
Namespace percona-operator
Cluster Name mysqlcluster
ReconcileID a4d06dd7-22fe-4d60-88e5-6582b4423b01
Selected Pod mysqlcluster-pxc-2
Max Sequence Number 1,630,159
Timestamp (UTC) maxSeq
2026-01-27 13:41:04 3,092,897
2026-01-27 14:13:40 3,218,826
2026-01-27 14:46:44 71,769 :warning: (dropped dramatically!)
2026-01-27 15:23:19 133,999
2026-01-27 16:25:14 214,356
(steadily increasing)
2026-01-29 20:37:09 2,090,666

it was using pxc-2 before

Adding more logs

2026-01-27T14:13:40.274Z INFO We are in full cluster crash, starting recovery
controller: pxc-controller
namespace: percona-operator
name: mysqlcluster
reconcileID: 5546e2de-0ce7-41aa-aa00-20af48772fa4

2026-01-27T14:13:40.274Z INFO Results of scanning sequences
controller: pxc-controller
namespace: percona-operator
name: mysqlcluster
reconcileID: 5546e2de-0ce7-41aa-aa00-20af48772fa4
pod: mysqlcluster-pxc-2
maxSeq: 3218826

Replication errors

2026-01-27T14:14:14.396Z INFO reconcile replication error
err: “get primary pxc pod: failed to get proxy connection: dial tcp 100.96.148.158:3306: connect: connection refused”

2026-01-27T14:15:22.939Z INFO reconcile replication error
err: “failed to ensure cluster readonly status: connect to pod mysqlcluster-pxc-1: dial tcp 100.100.156.9:33062: connect: connection refused”

2026-01-27T14:30:29.684Z INFO reconcile replication error
err: “failed to ensure cluster readonly status: connect to pod mysqlcluster-pxc-0: dial tcp 100.100.140.20:33062: connect: connection refused”

2026-01-27T14:39:17.679Z INFO reconcile replication error
err: “get primary pxc pod: failed to get proxy connection: invalid connection”

2026-01-27T14:42:33.238Z INFO reconcile replication error
err: “get primary pxc pod: failed to get proxy connection: invalid connection”

and then

2026-01-27T14:46:44.526Z INFO We are in full cluster crash, starting recovery
controller: pxc-controller
namespace: percona-operator
name: mysqlcluster
reconcileID: f03c5d80-59d6-4e1a-b121-3cd5380f9c16

2026-01-27T14:46:44.526Z INFO Results of scanning sequences
controller: pxc-controller
namespace: percona-operator
name: mysqlcluster
reconcileID: f03c5d80-59d6-4e1a-b121-3cd5380f9c16
pod: mysqlcluster-pxc-2
maxSeq: 71769

and

2026-01-27T15:23:19.753Z INFO Results of scanning sequences
pod: mysqlcluster-pxc-2
maxSeq: 133999

2026-01-27T16:25:14.631Z INFO Results of scanning sequences
pod: mysqlcluster-pxc-2
maxSeq: 214356

2026-01-27T17:13:06.349Z INFO Results of scanning sequences
pod: mysqlcluster-pxc-2
maxSeq: 272598

So it persistently used mysqlcluster-pxc-2 for recovery, right?

Yes, but I’m trying to understand why did sequence number reduce or what finally caused seq no to be -1 and how can I avoid such things in the future

Also saw from the logs that my cluster did crash 108 times from 27 to 29, mostly due to OOM issues, I’ve fixed the memory, but is there anything else aswell, that I should fix?

As far as I understand the pods were crashing for days and operator performed a lot of full cluster crash recoveries. Can you tell me the number of Results of scanning sequences lines in log?

“Results of scanning sequences” lines: 108

This matches exactly with the 108 “full cluster crash” occurrences,

OK, a few more questions so I can raise this internally.

  1. You mentioned pod-0 had sst_in_progress file. Can you see who was the donor from pod-0 logs?
  2. Operator was trying to bootstrap cluster from pod-2. Was pod-2 performing SST to join the cluster at any point?

I’ve actually set safe_to_bootstrap on pxc-0 and then restarted the pods by increasing the memory, and hence now the logs only show that PXC-0 is donor, because I’ve set safe_to_bootstrap here

Operator was trying to bootstrap cluster from pod-2. Was pod-2 performing SST to join the cluster at any point?

Regarding this, I’ll try to check this if I can find this. Thanks for quick responses @Ege_Gunes

Found this information from pxc-2, with an error

2026-01-30T01:21:26.163222-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized server arguments: --datadir=/var/lib/mysql --server-id=16892862 --innodb_flush_log_at_trx_commit=0 --innodb_flush_method=O_DIRECT --innodb_file_per_table=1 --innodb_buffer_pool_size=6450839552 --innodb_flush_method=O_DIRECT --innodb_flush_log_at_trx_commit=1 --defaults_group=mysqld

2026-01-30T01:21:26.163451-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized client arguments: --socket=/tmp/mysql.sock --compress=lz4 --no-version-check=1 --parallel=4 --user=mysql.pxc.sst.user --password=* --socket=/tmp/mysql.sock --lock-ddl=1 --backup=1 --galera-info=1 --stream=xbstream --xtrabackup-plugin-dir=/usr/bin/pxc_extra/pxb-8.0/lib/plugin --target-dir=/tmp/pxc_sst_Mp5f/donor_xb_QArw

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup version 8.0.32-26 based on MySQL server 8.0.32 Linux (x86_64) (revision id: 34cf2908)

2026-01-30T01:21:26.163481-00:00 0 [Note] [MY-011825] [Xtrabackup] Connecting to MySQL server host: localhost, user: mysql.pxc.sst.user, password: set, port: not set, socket: /tmp/mysql.sock

2026-01-30T01:21:26.172045-00:00 0 [Note] [MY-011825] [Xtrabackup] Using server version 8.0.32-24.2

2026-01-30T01:21:26.333460-00:00 0 [Note] [MY-011825] [Xtrabackup] Executing LOCK TABLES FOR BACKUP …

xtrabackup: Unknown error 1158

2026-02-03T16:35:47.609851-00:00 0 [ERROR] [MY-011825] [Xtrabackup] failed to execute query ‘LOCK TABLES FOR BACKUP’ : 2013 (HY000) Lost connection to MySQL server during query

Hey @reddy_nishanth,
I wanted to chime in and answer one of your questions above. When PXC is operating normally, the sequence number (seqno) is tracked in memory, and written to InnoDB’s redo log on transaction commit. The grastate.dat file will show -1 as this file is not written to during normal operations. Only on clean shutdown does grastate.dat get written with a proper sequence number. Since your file has -1, that indicates mysql crashed and needs to perform wsrep-recovery to retrieve the latest seqno from InnoDB’s redo log. The operator should do this automatically.

Make sure that you always do this on the node with the highest seqno. If you set safe-bootstrap:1 on a node with a lower sequence number, you are telling the software that the seqno on this node is correct, and the golden source of truth. If any node has a higher seqno, that data should be erased, and SST performed from this node. That can cause data loss, so be careful there.

1 Like

Hi @reddy_nishanth, do you have the full log from the PXC pods? We have logic in enrypoint percona-xtradb-cluster-operator/build/pxc-entrypoint.sh at main · percona/percona-xtradb-cluster-operator · GitHub that should perform wsrep-recovery, and I want to understand why it was skipped or not performed.

Hi @Slava_Sarzhan The logs are over 100 MB, let me know if you are looking for some specific patterns, I can search and share or let me know how can I upload them for you

Here is what I’ve analysed with AI on the sequence of events

Time Node Event
00:00:24 pxc-1 :cross_mark: XtraBackup KILLED (exit 137) during SST prepare
00:00:28 pxc-1 Crashed with null UUID (00000000-0000-0000-0000-000000000000:-1)
00:00:28 pxc-0, pxc-2 Cluster continues with 2 members
00:07:52 pxc-0 :cross_mark: ALSO XtraBackup KILLED (exit 137) during SST prepare!
00:07:55 pxc-0 Crashed with “Application state transfer failed”
00:08:27 pxc-0 Found saved state: 00000000-0000-0000-0000-000000000000:-1
00:17:19 pxc-0 FULL_PXC_CLUSTER_CRASH detected
00:27:17 pxc-2 FULL_PXC_CLUSTER_CRASH detected
00:27:48 pxc-2 Still had valid state: 103db3ca-fb8f-11f0-8b2e-87a5524eb103:2221259
01:15:09 pxc-2 :cross_mark: pxc-2 ALSO lost its state: 00000000-0000-0000-0000-000000000000:-1

I can see logs like forgetting `percona-xtradb-cluster {“log”:“2026-01-30T01:23:16.079639Z 0 [Note] [MY-000000] [Galera] forgetting f349b16b-9387 (ssl://100.100.156.9:4567)\n”,“kubernetes”:{“pod_name”:“mysqlcluster-pxc-0”,“namespace_name”:“percona-operator”,“pod_id”:“11bd429e-5215-4c52-872e-e5ceff6ca448”,“labels”:{“app”:“percona-xtradb-cluster”,“``app.kubernetes.io/component":“pxc”,“app.kubernetes.io/instance”:“mysqlcluster”,“app.kubernetes.io/managed-by”:“percona-xtradb-cluster-operator”,“app.kubernetes.io/name”:“percona-xtradb-cluster”,“app.kubernetes.io/part-of”:“percona-xtradb-cluster”,“apps.kubernetes.io/pod-index”:“0”,“controller-revision-hash”:“mysqlcluster-pxc-794bd46cf”,“polaris_team”:“not_available”,“statefulset.kubernetes.io/pod-name”:"mysqlcluster-pxc-0``”},`

and also killed during prepare

2026-01-30T00:10:28+00:00 percona-xtradb-cluster {“log”:“2026-01-30T00:10:28.296717Z 0 [Note] [MY-000000] [WSREP-SST] /usr/bin/wsrep_sst_xtrabackup-v2: line 191: 1226 Killed /usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup --no-version-check --use-memory=6450839552 --prepare $rebuildcmd $keyringapplyopt $encrypt_prepare_options --rollback-prepared-trx --xtrabackup-plugin-dir=/usr/bin/pxc_extra/pxb-8.0/lib/plugin --target-dir=${DATA} &> ${DATA}/innobackup.prepare.log\n”,“kubernetes”:{“pod_name”:“mysqlcluster-pxc-1”,“namespace_name”:“percona-operator”,“pod_id”:“7bab02ce-fbd1-4f3f-ae77-cdf4440ea4f3”,“labels”:{“app”:“percona-xtradb-cluster”,"``app.kubernetes.io/component":“pxc”,“app.kubernetes.io/instance”:"m

and

2026-01-30T00:30:17+00:00 percona-xtradb-cluster {“log”:“2026-01-30T00:30:17.471147Z 0 [Warning] [MY-000000] [Galera] Member 2.0 (mysqlcluster-pxc-1) requested state transfer from ‘mysqlcluster-pxc-1,’, but it is impossible to select State Transfer donor: Resource temporarily unavailable\n”,“kubernetes”:{“pod_name”:“mysqlcluster-pxc-2”,“namespace_name”:“percona-operator”,“pod_id”:“8aee6284-430e-4d39-9c6b-73a8d9b27a0f”,“labels”:{“app”:“percona-xtradb-cluster”,"``app.kubernetes.io/component":“pxc”,“app.kubernetes.io/instance”:“mysqlcluster”,“app.kubernetes.io/managed-by”:“percona-xtradb-cluster-operator”,“app.kubernetes.io/name”:“percona-xtradb-cluster”,“app.kubernetes.io/part-of”:"percona-xtradb-cluster

Looks like operator restarted it the only valid PXC-2 and post that it joined as donor

00:26:26 - “Received SHUTDOWN from user ”

00:26:41 - “Shutdown complete”

00:27:48 - pxc-2 restarted

00:27:50 - pxc-2 became a JOINER: “Server status change connected → joiner”

Ot would’ve failed because of liveness probe or something, couldn’t find any other reason for restart and kubectl logs are lost

Hi @reddy_nishanth,

Your log analysis here is thorough. I wanted to add some context to what @matthewb and @Ege_Gunes have covered about why the seqno dropped so dramatically.

I ran some experiments with a 3-node PXC 8.0 cluster to understand the failure mechanism. The key finding: when the operator selects a node whose seqno has regressed (likely because an interrupted SST overwrote its data directory), all other nodes receive that stale state via SST, causing seqno regression across the cluster. In your case, 108 OOM cycles over 48 hours compounded this. Each cycle killed nodes mid-SST, leaving joiners with partial data directories. Eventually the operator selected one of these nodes (with seqno 71,769 instead of the original 3,218,826) to bootstrap from, and the regression cascaded to all nodes.

The XtraBackup error you found on pxc-2 (Lost connection to MySQL server during query) is consistent with the SST being interrupted by OOM. The sst_in_progress file on pxc-0 tells the same story. The seqno -1 in grastate.dat is expected after unclean shutdowns, while the all-zero UUID indicates grastate.dat was missing or overwritten during an interrupted SST rather than simply surviving a crash.

This is tracked in K8SPXC-824.

For prevention: right-size your memory limits (the operator auto-tunes innodb_buffer_pool_size to 75% of containerMemoryLimit, but lowering to 50-70% leaves more headroom for SST and OS overhead in OOM-prone environments).

Also, consider setting autoRecovery: false in production so the operator waits for human intervention instead of looping through crash cycles. As @matthewb noted, always verify seqno with mysqld --wsrep-recover on each node before setting safe_to_bootstrap.

References:

Thank you @anderson.nogueira m @Ege_Gunes and @matthewb for looking into this issue and providing prompt responses.

You guys are amazing with all the help, great work :folded_hands:

1 Like

Hi Team, I’ve again run into a similar issue where the cluster is stuck in the initializing stat

I’ve summarised the logs here with the help of AI


Environment

  • Percona XtraDB Cluster: 8.0.42-33.1
  • WSREP version: 26.1.4.3
  • Galera library: libgalera_smm.so (Galera 4)
  • Percona Operator: crVersion 1.19.0
  • Cluster: 3 PXC nodes + 3 HAProxy nodes
  • Update strategy: SmartUpdate
  • SST method: xtrabackup-v2
  • PXC memory limit: 16Gi (requests: 8Gi)

Timeline
────────────────────────────────────────
Time (UTC): Mar 3, 12:10
Event: Cluster status → ready.
Evidence: PXC CR status conditions
────────────────────────────────────────
Time (UTC): Mar 3, 19:03-19:04
Event: Brief error (invalid connection), auto-recovered. Cluster back to ready at 19:04:36.
Evidence: PXC CR status conditions
────────────────────────────────────────
Time (UTC): Mar 4, 08:52:10
Event: Cluster hits error: manage sys users: is old password discarded: select User_attributes field:
invalid connection
Evidence: PXC CR status conditions
────────────────────────────────────────
Time (UTC): Mar 4, 08:52:11
Event: Cluster status → initializing (never recovers after this point).
Evidence: PXC CR status conditions
────────────────────────────────────────
Time (UTC): Mar 4, ~08:53
Event: PXC-1 liveness probe starts failing. Two failure modes observed: (1) command timed out:
“/var/lib/mysql/liveness-check.sh” timed out after 15s — script timed out under SST I/O load. (2) [[
non-Primary == Primary ]] → exit 1 — liveness script detected wsrep_cluster_status as non-Primary
(Donor/Desynced state). No OOM on PXC-1’s node — confirmed via dmesg on 10.0.213.227 showing zero
memory events.
Evidence: describe pod mysqlcluster-pxc-1 Events; dmesg on PXC-1 node clean
────────────────────────────────────────
Time (UTC): Mar 4, 08:53:15
Event: Kubelet kills PXC-1 after 5 consecutive liveness failures. PXC-1 receives SHUTDOWN from user <via

user signal>.

Evidence: PXC-1 previous logs: Received SHUTDOWN from user
────────────────────────────────────────
Time (UTC): Mar 4, 08:53-08:57
Event: PXC-0 loses its peer (PXC-1). Tries to reconnect 30+ times (Failed to establish connection:
Connection refused). PXC-0 was in JOINER state receiving SST from PXC-1. SST fails. PXC-0 crashes
with
signal 11 (SEGFAULT) in libgalera_smm.so during SST cancellation.
Evidence: PXC-0 previous logs: mysqld got signal 11, backtrace in /usr/lib64/galera4/libgalera_smm.so
────────────────────────────────────────
Time (UTC): Mar 4, 08:58:00
Event: PXC-0 OOM-killed by kernel. mysqld hit the 16Gi container memory limit during SST joiner
processing. Kernel log: memory: usage 16777216kB, limit 16777216kB, failcnt 18719. Process stats:
total-vm:30853816kB (~29.4GB), anon-rss:16727232kB (~16GB).
Evidence: dmesg on PXC-0 node (10.0.213.224): oom-kill: task=mysqld, pid=3290821
────────────────────────────────────────
Time (UTC): Mar 4, 08:57-08:59
Event: PXC-2 also trying SST as joiner. SST fails with Broken pipe. Error: State transfer request failed

unrecoverably: 32 (Broken pipe). PXC-2 also crashes with signal 11 (SEGFAULT) in same Galera library.

Evidence: PXC-2 previous logs: SST script aborted with error 32 (Broken pipe), mysqld got signal 11
────────────────────────────────────────
Time (UTC): Mar 4, 08:54:35
Event: Operator detects: “We are in full cluster crash, starting recovery” (second time).
Evidence: Operator logs
────────────────────────────────────────
Time (UTC): Mar 4, 08:55-08:57
Event: PXC-0 bootstraps as primary (view 738), requests SST as JOINER. Gets error: Missing version
string
in comparison and xtrabackup_checkpoints missing. xtrabackup/SST failed on DONOR. SST fails again.
Evidence: PXC-0 previous logs: FATAL ERROR: xtrabackup_checkpoints missing
────────────────────────────────────────
Time (UTC): Mar 4, ~09:00
Event: All 3 pods restart. PXC-0 and PXC-2 detect FULL_PXC_CLUSTER_CRASH with seqno: -1 and wait. PXC-1
enters SST donor mode but SST stream is stuck at 0 bytes/sec.
Evidence: PXC-1 current logs: donor: => Rate:[0.00 B/s] for 5+ hours (Elapsed: 5:41:00)
────────────────────────────────────────
Time (UTC): Now
Event: PXC pods show 2/2 Running but MySQL is NOT running inside them. PXC-0 & PXC-2 are sitting at the
crash recovery prompt. PXC-1 is stuck in SST donor mode sending 0 bytes. Cluster state: initializing,
ready: 1/3.
Evidence: mysqladmin ping on PXC-0: Can’t connect to local MySQL server through socket


Root Causes

  1. Primary trigger: PXC-0 needed a StatefulSet revision update (SmartUpdate strategy). The operator
    deleted and recreated PXC-0, which then required a full SST to rejoin the cluster (its data was stale
    after recreation).
  2. Cascading failure: PXC-1 was selected as SST donor. During the heavy xtrabackup I/O, its liveness
    probe failed (no OOM — confirmed clean dmesg on PXC-1’s node) because:
    • The liveness check script (/var/lib/mysql/liveness-check.sh) timed out (>15s) under I/O load
    • The script checks wsrep_cluster_status — during SST donor mode, PXC-1 reports non-Primary
      (Donor/Desynced), which the script treats as unhealthy
    • After 5 consecutive failures, kubelet killed PXC-1
  3. OOM on PXC-0: When PXC-0 was receiving SST as a joiner, its mysqld process was OOM-killed by the
    kernel at 08:58 UTC. It hit the 16Gi container memory limit with failcnt: 18719 (18,719 failed memory
    allocations). The process had grown to ~29.4 GB virtual memory. The 16Gi limit appears insufficient for
    SST joiner operations (xtrabackup receive + decompress + InnoDB tablespace import) at this dataset size.
  4. Galera bug: When PXC-1 was killed mid-SST, both PXC-0 and PXC-2 crashed with signal 11 (SEGFAULT) in
    libgalera_smm.so during SST cancellation. Backtrace:
    /usr/lib64/galera4/libgalera_smm.so(+0x45b09)
    /usr/lib64/galera4/libgalera_smm.so(+0x62491)
    /usr/lib64/galera4/libgalera_smm.so(+0x82aba)
    /usr/lib64/galera4/libgalera_smm.so(+0x7683d)
    /usr/lib64/galera4/libgalera_smm.so(+0x77283)
    /usr/lib64/galera4/libgalera_smm.so(+0x778db)
    /usr/lib64/galera4/libgalera_smm.so(+0x98fc6)
  5. Recovery deadlock (current state): After the full cluster crash, auto-recovery started but is stuck:
    • All nodes have seqno: -1
    • PXC-1 entered SST donor mode but the stream is hung (0 bytes transferred for 5+ hours)
    • PXC-0 and PXC-2 are waiting at the recovery prompt
    • No MySQL instance is serving queries

Liveness Probe Configuration

Liveness: exec [/var/lib/mysql/liveness-check.sh] delay=300s timeout=15s period=10s #success=1
#failure=5
Readiness: exec [/var/lib/mysql/readiness-check.sh] delay=15s timeout=15s period=30s #success=1
#failure=5


I’m going to increase the memory of PXC from 16 GB to 24 GB, but can you also let me know, if something is wrong with liveness that’s causing these failures and cluster becoming stuck