Full crash Percona after OOM

Description:

Why did my cluster crash and become non-recoverable after OOM?

Steps to Reproduce:

PXC cluster pods did OOM, and there were many restarts

polaris@RKLAB-RVMHM255S006481:~$ k get all -n percona-operator
NAME                                                   READY   STATUS                       RESTARTS           AGE
pod/mysqlcluster-haproxy-0                             1/2     Running                      1508 (11s ago)     22d
pod/mysqlcluster-haproxy-1                             1/2     CrashLoopBackOff             1512 (2m1s ago)    22d
pod/mysqlcluster-haproxy-2                             1/2     Running                      1435 (2m35s ago)   22d
pod/mysqlcluster-pxc-0                                 1/1     Running                      545 (4d14h ago)    22d
pod/mysqlcluster-pxc-1                                 1/1     Running                      661 (4d14h ago)    22d
pod/mysqlcluster-pxc-2                                 1/1     Running                      212 (4d14h ago)    22d
pod/percona-xtradb-cluster-operator-584685d9df-sl7mv   1/1     Running                      1 (31h ago)        22d
pod/xb-backup-260122-1134-44wmj                        0/1     CreateContainerConfigError   0                  12d

NAME                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                          AGE
service/mysqlcluster-haproxy              ClusterIP   100.96.148.158   <none>        3306/TCP,3309/TCP,33062/TCP,33060/TCP,8404/TCP   22d
service/mysqlcluster-haproxy-replicas     ClusterIP   100.96.64.20     <none>        3306/TCP                                         22d
service/mysqlcluster-pxc                  ClusterIP   None             <none>        3306/TCP,33062/TCP,33060/TCP                     22d
service/mysqlcluster-pxc-unready          ClusterIP   None             <none>        3306/TCP,33062/TCP,33060/TCP                     22d
service/percona-xtradb-cluster-operator   ClusterIP   100.96.204.93    <none>        443/TCP                                          22d

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/percona-xtradb-cluster-operator   1/1     1            1           22d

NAME                                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/percona-xtradb-cluster-operator-584685d9df   1         1         1       22d

NAME                                    READY   AGE
statefulset.apps/mysqlcluster-haproxy   0/3     22d
statefulset.apps/mysqlcluster-pxc       3/3     22d

Reason was OOM killed

Post that, all of them started reporting -1

polaris@RKLAB-RVMHM255S006481:~$ sudo kubectl exec -n percona-operator mysqlcluster-pxc-0 -c pxc -- cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    00000000-0000-0000-0000-000000000000
seqno:   -1
safe_to_bootstrap: 0
     polaris@RKLAB-RVMHM255S006481:~$ kubectl exec -n percona-operator mysqlcluster-pxc-1 -c pxc -- cat /var/lib/mysql/grastate.date.dat
# GALERA saved state
version: 2.1
uuid:    00000000-0000-0000-0000-000000000000
seqno:   -1
safe_to_bootstrap: 0
     polaris@RKLAB-RVMHM255S006481:~$ kubectl exec -n percona-operator mysqlcluster-pxc-2 -c pxc -- cat /var/lib/mysql/grastate.date.dat
# GALERA saved state
version: 2.1
uuid:    00000000-0000-0000-0000-000000000000
seqno:   -1
safe_to_bootstrap: 0

And logs from one the pxc-1 shows that highest sequence number is -1

#####################################################FULL_PXC_CLUSTER_CRASH:mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local#####################################################
You have the situation of a full PXC cluster crash. In order to restore your PXC cluster, please check the log
from all pods/nodes to find the node with the most recent data (the one with the highest sequence number (seqno).
It is mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local node with sequence number (seqno): -1
Cluster will recover automatically from the crash now.
If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:
kubectl -n percona-operator exec mysqlcluster-pxc-1 -c pxc -- sh -c 'kill -s USR1 1'
#####################################################LAST_LINE:mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local:-1:#####################################################
polaris@RKLAB-RVMHM255S006481:~$

Version:

1.17.0 version operator

Logs:

Attached above, additional logs, saw that there is -rw-rw---- 1 mysql mysql 0 Jan 30 01:16 sst_in_progress on Jan 30 when this all happened

polaris@RKLAB-RVMHM255S006481:~$ sudo kubectl exec -n percona-operator mysqlcluster-pxc-0 -c pxc -- ls -la /var/lib/mysql/
total 2274548
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:37 #ib_16384_0.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_1.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:39 #ib_16384_10.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_11.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:37 #ib_16384_12.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_13.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:39 #ib_16384_14.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_15.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:37 #ib_16384_2.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_3.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:39 #ib_16384_4.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_5.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:39 #ib_16384_6.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_7.dblwr
-rw-r----- 1 mysql  mysql       65536 Feb  3 16:39 #ib_16384_8.dblwr
-rw-r----- 1 mysql  mysql     1114112 Feb  3 16:29 #ib_16384_9.dblwr
drwxr-s--- 2 mysql  mysql        4096 Feb  3 16:30 #innodb_redo
drwxr-s--- 2 mysql  mysql         187 Feb  3 16:30 #innodb_temp
drwxrwsrwx 7 root   mysql       12288 Feb  3 16:37 .
drwxr-xr-x 1 root   root           19 May 18  2023 ..
-rw-rw---- 1 mysql  mysql         399 Jan 27 13:47 GRA_10_3101111_v2.log
-rw-rw---- 1 mysql  mysql         386 Jan 27 13:47 GRA_10_3102292_v2.log
-rw-rw---- 1 mysql  mysql         376 Jan 27 13:47 GRA_10_3102569_v2.log
-rw-rw---- 1 mysql  mysql         376 Jan 27 13:47 GRA_10_3102601_v2.log
-rw-rw---- 1 mysql  mysql         492 Jan 27 14:03 GRA_11_3159148_v2.log
-rw-rw---- 1 mysql  mysql         492 Jan 27 14:03 GRA_11_3163950_v2.log
-rw-rw---- 1 mysql  mysql         483 Jan 27 14:04 GRA_11_3170913_v2.log
-rw-rw---- 1 mysql  mysql         456 Jan 27 14:04 GRA_11_3170917_v2.log
-rw-rw---- 1 mysql  mysql         376 Jan 27 14:03 GRA_1_3159556_v2.log
-rw-rw---- 1 mysql  mysql         492 Jan 27 14:04 GRA_1_3168943_v2.log
-rw-rw---- 1 mysql  mysql         376 Jan 27 14:04 GRA_1_3169716_v2.log
-rw-rw---- 1 mysql  mysql         371 Jan 27 14:04 GRA_1_3170914_v2.log
-rw-rw---- 1 mysql  mysql         371 Jan 27 14:04 GRA_1_3175329_v2.log
-rw-rw---- 1 mysql  mysql         492 Jan 27 13:46 GRA_2_3097310_v2.log
-rw-rw---- 1 mysql  mysql         376 Jan 27 13:47 GRA_2_3100880_v2.log
-rw-rw---- 1 mysql  mysql         492 Jan 27 13:47 GRA_2_3102064_v2.log
-rw-rw---- 1 mysql  mysql         412 Jan 27 13:47 GRA_2_3102293_v2.log
-rw-rw-r-- 1 mysql  mysql          22 Feb  3 16:29 auth_plugin
-rw-r----- 1 mysql  mysql          56 Feb  3 16:29 auto.cnf
-rw-r----- 1 mysql  mysql         180 Feb  3 16:30 binlog.000001
-rw-r----- 1 mysql  mysql         201 Feb  3 16:31 binlog.000002
-rw-r----- 1 mysql  mysql         157 Feb  3 16:31 binlog.000003
-rw-r----- 1 mysql  mysql          48 Feb  3 16:31 binlog.index
-rw-rw---- 1 mysql  mysql  2147484952 Feb  3 16:37 galera.cache
-rwxr-xr-x 1 daemon daemon       1138 Feb  3 16:29 get-pxc-state
-rw-rw---- 1 mysql  mysql         118 Feb  3 16:37 grastate.dat
-rw-r----- 1 mysql  mysql         264 Feb  3 16:37 gvwstate.dat
-rw-r----- 1 mysql  mysql        6638 Feb  3 16:30 ib_buffer_pool
-rw-r----- 1 mysql  mysql    12582912 Feb  3 16:37 ibdata1
-rw-r----- 1 mysql  mysql    12582912 Feb  3 16:30 ibtmp1
-rw-rw---- 1 mysql  mysql       44185 Feb  3 16:31 innobackup.backup.log
-rw-rw---- 1 mysql  mysql    76004458 Jan 27 14:02 innobackup.move.log
-rw-rw---- 1 mysql  mysql       33821 Jan 27 14:02 innobackup.prepare.log
-rwxr-xr-x 1 daemon daemon       1708 Feb  3 16:29 liveness-check.sh
drwxr-s--- 2 mysql  mysql         232 Feb  3 16:29 mysql
-rwxr-xr-x 1 daemon daemon    1740952 Feb  3 16:29 mysql-state-monitor
-rw-rw-r-- 1 mysql  mysql        2046 Feb  3 16:30 mysql-state-monitor.log
-rw-r----- 1 mysql  mysql    31457280 Feb  3 16:37 mysql.ibd
-rw-rw-r-- 1 mysql  mysql         147 Feb  3 16:30 mysql.state
-rw-r----- 1 mysql  mysql         792 Feb  3 16:30 mysqlcluster-pxc-0-slow.log
-rw-r----- 1 mysql  mysql           2 Feb  3 16:30 mysqlcluster-pxc-0.pid
-rw-rw---- 1 mysql  mysql        4489 Jan 27 14:02 mysqld.post.processing.log
srwxrwxrwx 1 mysql  mysql           0 Feb  3 16:30 mysqlx.sock
-rw------- 1 mysql  mysql           2 Feb  3 16:30 mysqlx.sock.lock
srwxr-xr-x 1 mysql  mysql           0 Feb  3 16:29 notify.sock
-rwxr-xr-x 1 daemon daemon    3852289 Feb  3 16:29 peer-list
drwxr-s--- 2 mysql  mysql        8192 Feb  3 16:29 performance_schema
-rwxr-xr-x 1 daemon daemon        830 Feb  3 16:29 pmm-prerun.sh
-rw------- 1 mysql  mysql        1680 Feb  3 16:29 private_key.pem
-rw-r--r-- 1 mysql  mysql         452 Feb  3 16:29 public_key.pem
-rwxr-xr-x 1 daemon daemon       5471 Feb  3 16:29 pxc-configure-pxc.sh
-rwxr-xr-x 1 daemon daemon      25159 Feb  3 16:29 pxc-entrypoint.sh
-rwxr-xr-x 1 daemon daemon       1470 Feb  3 16:29 readiness-check.sh
-rw-rw---- 1 mysql  mysql           0 Jan 30 01:16 sst_in_progress
drwxr-s--- 2 mysql  mysql          28 Feb  3 16:29 sys
-rw-r----- 1 mysql  mysql    16777216 Feb  3 16:39 undo_001
-rw-r----- 1 mysql  mysql    16777216 Feb  3 16:39 undo_002
-rwxr-xr-x 1 daemon daemon        543 Feb  3 16:29 unsafe-bootstrap.sh
-rw-rw-r-- 1 mysql  mysql         142 Feb  3 16:30 version_info
-rwxr-xr-x 1 daemon daemon        618 Feb  3 16:29 wsrep_cmd_notify_handler.sh
-rw-rw-r-- 1 mysql  mysql       98344 Jan 27 14:14 wsrep_recovery_verbose_history.log

Expected Result:

Expected that after OOM, the data is still present

Actual Result:

Tables etc were deleted

Additional Information:

How to debug this and know what caused this failure

Even if pods killed by OOM during SST data should be intact on the donor node. I wonder whether operator bootstrapped the cluster from the wrong node. Do you have operator logs from that time?

Hi, Looking at the operator logs, here is the summary

Jan 11, 2026 15:58 Operator started successfully, cluster was healthy (PXC version 8.0.32-24.2)
Jan 11, 2026 18:14 First connection issues appear (pxc-2 connection refused)
Jan 12-14 Intermittent connection issues continue (connection refused, connection reset by peer)
Jan 25, 2026 20:28 First “cluster is not ready” message appears
Jan 27, 2026 13:39 WSREP errors begin - “WSREP has not yet prepared node for application use”
Jan 27-28 Continuous WSREP errors, cluster completely down
Jan 28, 2026 00:01 Connection refused errors to all nodes (pxc-0, pxc-1)
Jan 28 onwards Cluster remains down with continuous reconcile errors

The exact logs is

2026-01-27T13:39:18+00:00 percona-xtradb-cluster-operator {“log”:"2026-01-27T13:39:18.495Z\tINFO\treconcile replication error\t{“controller”: “pxc-controller”, “namespace”: “percona-operator”, “name”: “mysqlcluster”, “reconcileID”: “8aa6d2ee-b277-4561-96c3-be636f271208”, “err”: “remove outdated replication channels: get current replication channels: select current replication channels: Error 1047 (08S01): WSREP has not yet prepared node for application use”}

let me know if you want to me to seach for some specific errors, the logs file is huge

Do you have Results of scanning sequences logs?

Summary

Field Value
Timestamp 2026-01-29T07:44:49.550Z
Level INFO
Message Results of scanning sequences
Controller pxc-controller
Namespace percona-operator
Cluster Name mysqlcluster
ReconcileID a4d06dd7-22fe-4d60-88e5-6582b4423b01
Selected Pod mysqlcluster-pxc-2
Max Sequence Number 1,630,159
Timestamp (UTC) maxSeq
2026-01-27 13:41:04 3,092,897
2026-01-27 14:13:40 3,218,826
2026-01-27 14:46:44 71,769 :warning: (dropped dramatically!)
2026-01-27 15:23:19 133,999
2026-01-27 16:25:14 214,356
(steadily increasing)
2026-01-29 20:37:09 2,090,666

it was using pxc-2 before

Adding more logs

2026-01-27T14:13:40.274Z INFO We are in full cluster crash, starting recovery
controller: pxc-controller
namespace: percona-operator
name: mysqlcluster
reconcileID: 5546e2de-0ce7-41aa-aa00-20af48772fa4

2026-01-27T14:13:40.274Z INFO Results of scanning sequences
controller: pxc-controller
namespace: percona-operator
name: mysqlcluster
reconcileID: 5546e2de-0ce7-41aa-aa00-20af48772fa4
pod: mysqlcluster-pxc-2
maxSeq: 3218826

Replication errors

2026-01-27T14:14:14.396Z INFO reconcile replication error
err: “get primary pxc pod: failed to get proxy connection: dial tcp 100.96.148.158:3306: connect: connection refused”

2026-01-27T14:15:22.939Z INFO reconcile replication error
err: “failed to ensure cluster readonly status: connect to pod mysqlcluster-pxc-1: dial tcp 100.100.156.9:33062: connect: connection refused”

2026-01-27T14:30:29.684Z INFO reconcile replication error
err: “failed to ensure cluster readonly status: connect to pod mysqlcluster-pxc-0: dial tcp 100.100.140.20:33062: connect: connection refused”

2026-01-27T14:39:17.679Z INFO reconcile replication error
err: “get primary pxc pod: failed to get proxy connection: invalid connection”

2026-01-27T14:42:33.238Z INFO reconcile replication error
err: “get primary pxc pod: failed to get proxy connection: invalid connection”

and then

2026-01-27T14:46:44.526Z INFO We are in full cluster crash, starting recovery
controller: pxc-controller
namespace: percona-operator
name: mysqlcluster
reconcileID: f03c5d80-59d6-4e1a-b121-3cd5380f9c16

2026-01-27T14:46:44.526Z INFO Results of scanning sequences
controller: pxc-controller
namespace: percona-operator
name: mysqlcluster
reconcileID: f03c5d80-59d6-4e1a-b121-3cd5380f9c16
pod: mysqlcluster-pxc-2
maxSeq: 71769

and

2026-01-27T15:23:19.753Z INFO Results of scanning sequences
pod: mysqlcluster-pxc-2
maxSeq: 133999

2026-01-27T16:25:14.631Z INFO Results of scanning sequences
pod: mysqlcluster-pxc-2
maxSeq: 214356

2026-01-27T17:13:06.349Z INFO Results of scanning sequences
pod: mysqlcluster-pxc-2
maxSeq: 272598

So it persistently used mysqlcluster-pxc-2 for recovery, right?

Yes, but I’m trying to understand why did sequence number reduce or what finally caused seq no to be -1 and how can I avoid such things in the future

Also saw from the logs that my cluster did crash 108 times from 27 to 29, mostly due to OOM issues, I’ve fixed the memory, but is there anything else aswell, that I should fix?

As far as I understand the pods were crashing for days and operator performed a lot of full cluster crash recoveries. Can you tell me the number of Results of scanning sequences lines in log?

“Results of scanning sequences” lines: 108

This matches exactly with the 108 “full cluster crash” occurrences,

OK, a few more questions so I can raise this internally.

  1. You mentioned pod-0 had sst_in_progress file. Can you see who was the donor from pod-0 logs?
  2. Operator was trying to bootstrap cluster from pod-2. Was pod-2 performing SST to join the cluster at any point?

I’ve actually set safe_to_bootstrap on pxc-0 and then restarted the pods by increasing the memory, and hence now the logs only show that PXC-0 is donor, because I’ve set safe_to_bootstrap here

Operator was trying to bootstrap cluster from pod-2. Was pod-2 performing SST to join the cluster at any point?

Regarding this, I’ll try to check this if I can find this. Thanks for quick responses @Ege_Gunes

Found this information from pxc-2, with an error

2026-01-30T01:21:26.163222-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized server arguments: --datadir=/var/lib/mysql --server-id=16892862 --innodb_flush_log_at_trx_commit=0 --innodb_flush_method=O_DIRECT --innodb_file_per_table=1 --innodb_buffer_pool_size=6450839552 --innodb_flush_method=O_DIRECT --innodb_flush_log_at_trx_commit=1 --defaults_group=mysqld

2026-01-30T01:21:26.163451-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized client arguments: --socket=/tmp/mysql.sock --compress=lz4 --no-version-check=1 --parallel=4 --user=mysql.pxc.sst.user --password=* --socket=/tmp/mysql.sock --lock-ddl=1 --backup=1 --galera-info=1 --stream=xbstream --xtrabackup-plugin-dir=/usr/bin/pxc_extra/pxb-8.0/lib/plugin --target-dir=/tmp/pxc_sst_Mp5f/donor_xb_QArw

/usr/bin/pxc_extra/pxb-8.0/bin/xtrabackup version 8.0.32-26 based on MySQL server 8.0.32 Linux (x86_64) (revision id: 34cf2908)

2026-01-30T01:21:26.163481-00:00 0 [Note] [MY-011825] [Xtrabackup] Connecting to MySQL server host: localhost, user: mysql.pxc.sst.user, password: set, port: not set, socket: /tmp/mysql.sock

2026-01-30T01:21:26.172045-00:00 0 [Note] [MY-011825] [Xtrabackup] Using server version 8.0.32-24.2

2026-01-30T01:21:26.333460-00:00 0 [Note] [MY-011825] [Xtrabackup] Executing LOCK TABLES FOR BACKUP …

xtrabackup: Unknown error 1158

2026-02-03T16:35:47.609851-00:00 0 [ERROR] [MY-011825] [Xtrabackup] failed to execute query ‘LOCK TABLES FOR BACKUP’ : 2013 (HY000) Lost connection to MySQL server during query