Description:
Why did my cluster crash and become non-recoverable after OOM?
Steps to Reproduce:
PXC cluster pods did OOM, and there were many restarts
polaris@RKLAB-RVMHM255S006481:~$ k get all -n percona-operator
NAME READY STATUS RESTARTS AGE
pod/mysqlcluster-haproxy-0 1/2 Running 1508 (11s ago) 22d
pod/mysqlcluster-haproxy-1 1/2 CrashLoopBackOff 1512 (2m1s ago) 22d
pod/mysqlcluster-haproxy-2 1/2 Running 1435 (2m35s ago) 22d
pod/mysqlcluster-pxc-0 1/1 Running 545 (4d14h ago) 22d
pod/mysqlcluster-pxc-1 1/1 Running 661 (4d14h ago) 22d
pod/mysqlcluster-pxc-2 1/1 Running 212 (4d14h ago) 22d
pod/percona-xtradb-cluster-operator-584685d9df-sl7mv 1/1 Running 1 (31h ago) 22d
pod/xb-backup-260122-1134-44wmj 0/1 CreateContainerConfigError 0 12d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/mysqlcluster-haproxy ClusterIP 100.96.148.158 <none> 3306/TCP,3309/TCP,33062/TCP,33060/TCP,8404/TCP 22d
service/mysqlcluster-haproxy-replicas ClusterIP 100.96.64.20 <none> 3306/TCP 22d
service/mysqlcluster-pxc ClusterIP None <none> 3306/TCP,33062/TCP,33060/TCP 22d
service/mysqlcluster-pxc-unready ClusterIP None <none> 3306/TCP,33062/TCP,33060/TCP 22d
service/percona-xtradb-cluster-operator ClusterIP 100.96.204.93 <none> 443/TCP 22d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/percona-xtradb-cluster-operator 1/1 1 1 22d
NAME DESIRED CURRENT READY AGE
replicaset.apps/percona-xtradb-cluster-operator-584685d9df 1 1 1 22d
NAME READY AGE
statefulset.apps/mysqlcluster-haproxy 0/3 22d
statefulset.apps/mysqlcluster-pxc 3/3 22d
Reason was OOM killed
Post that, all of them started reporting -1
polaris@RKLAB-RVMHM255S006481:~$ sudo kubectl exec -n percona-operator mysqlcluster-pxc-0 -c pxc -- cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 00000000-0000-0000-0000-000000000000
seqno: -1
safe_to_bootstrap: 0
polaris@RKLAB-RVMHM255S006481:~$ kubectl exec -n percona-operator mysqlcluster-pxc-1 -c pxc -- cat /var/lib/mysql/grastate.date.dat
# GALERA saved state
version: 2.1
uuid: 00000000-0000-0000-0000-000000000000
seqno: -1
safe_to_bootstrap: 0
polaris@RKLAB-RVMHM255S006481:~$ kubectl exec -n percona-operator mysqlcluster-pxc-2 -c pxc -- cat /var/lib/mysql/grastate.date.dat
# GALERA saved state
version: 2.1
uuid: 00000000-0000-0000-0000-000000000000
seqno: -1
safe_to_bootstrap: 0
And logs from one the pxc-1 shows that highest sequence number is -1
#####################################################FULL_PXC_CLUSTER_CRASH:mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local#####################################################
You have the situation of a full PXC cluster crash. In order to restore your PXC cluster, please check the log
from all pods/nodes to find the node with the most recent data (the one with the highest sequence number (seqno).
It is mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local node with sequence number (seqno): -1
Cluster will recover automatically from the crash now.
If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:
kubectl -n percona-operator exec mysqlcluster-pxc-1 -c pxc -- sh -c 'kill -s USR1 1'
#####################################################LAST_LINE:mysqlcluster-pxc-1.mysqlcluster-pxc.percona-operator.svc.cluster.local:-1:#####################################################
polaris@RKLAB-RVMHM255S006481:~$
Version:
1.17.0 version operator
Logs:
Attached above, additional logs, saw that there is -rw-rw---- 1 mysql mysql 0 Jan 30 01:16 sst_in_progress on Jan 30 when this all happened
polaris@RKLAB-RVMHM255S006481:~$ sudo kubectl exec -n percona-operator mysqlcluster-pxc-0 -c pxc -- ls -la /var/lib/mysql/
total 2274548
-rw-r----- 1 mysql mysql 65536 Feb 3 16:37 #ib_16384_0.dblwr
-rw-r----- 1 mysql mysql 1114112 Feb 3 16:29 #ib_16384_1.dblwr
-rw-r----- 1 mysql mysql 65536 Feb 3 16:39 #ib_16384_10.dblwr
-rw-r----- 1 mysql mysql 1114112 Feb 3 16:29 #ib_16384_11.dblwr
-rw-r----- 1 mysql mysql 65536 Feb 3 16:37 #ib_16384_12.dblwr
-rw-r----- 1 mysql mysql 1114112 Feb 3 16:29 #ib_16384_13.dblwr
-rw-r----- 1 mysql mysql 65536 Feb 3 16:39 #ib_16384_14.dblwr
-rw-r----- 1 mysql mysql 1114112 Feb 3 16:29 #ib_16384_15.dblwr
-rw-r----- 1 mysql mysql 65536 Feb 3 16:37 #ib_16384_2.dblwr
-rw-r----- 1 mysql mysql 1114112 Feb 3 16:29 #ib_16384_3.dblwr
-rw-r----- 1 mysql mysql 65536 Feb 3 16:39 #ib_16384_4.dblwr
-rw-r----- 1 mysql mysql 1114112 Feb 3 16:29 #ib_16384_5.dblwr
-rw-r----- 1 mysql mysql 65536 Feb 3 16:39 #ib_16384_6.dblwr
-rw-r----- 1 mysql mysql 1114112 Feb 3 16:29 #ib_16384_7.dblwr
-rw-r----- 1 mysql mysql 65536 Feb 3 16:39 #ib_16384_8.dblwr
-rw-r----- 1 mysql mysql 1114112 Feb 3 16:29 #ib_16384_9.dblwr
drwxr-s--- 2 mysql mysql 4096 Feb 3 16:30 #innodb_redo
drwxr-s--- 2 mysql mysql 187 Feb 3 16:30 #innodb_temp
drwxrwsrwx 7 root mysql 12288 Feb 3 16:37 .
drwxr-xr-x 1 root root 19 May 18 2023 ..
-rw-rw---- 1 mysql mysql 399 Jan 27 13:47 GRA_10_3101111_v2.log
-rw-rw---- 1 mysql mysql 386 Jan 27 13:47 GRA_10_3102292_v2.log
-rw-rw---- 1 mysql mysql 376 Jan 27 13:47 GRA_10_3102569_v2.log
-rw-rw---- 1 mysql mysql 376 Jan 27 13:47 GRA_10_3102601_v2.log
-rw-rw---- 1 mysql mysql 492 Jan 27 14:03 GRA_11_3159148_v2.log
-rw-rw---- 1 mysql mysql 492 Jan 27 14:03 GRA_11_3163950_v2.log
-rw-rw---- 1 mysql mysql 483 Jan 27 14:04 GRA_11_3170913_v2.log
-rw-rw---- 1 mysql mysql 456 Jan 27 14:04 GRA_11_3170917_v2.log
-rw-rw---- 1 mysql mysql 376 Jan 27 14:03 GRA_1_3159556_v2.log
-rw-rw---- 1 mysql mysql 492 Jan 27 14:04 GRA_1_3168943_v2.log
-rw-rw---- 1 mysql mysql 376 Jan 27 14:04 GRA_1_3169716_v2.log
-rw-rw---- 1 mysql mysql 371 Jan 27 14:04 GRA_1_3170914_v2.log
-rw-rw---- 1 mysql mysql 371 Jan 27 14:04 GRA_1_3175329_v2.log
-rw-rw---- 1 mysql mysql 492 Jan 27 13:46 GRA_2_3097310_v2.log
-rw-rw---- 1 mysql mysql 376 Jan 27 13:47 GRA_2_3100880_v2.log
-rw-rw---- 1 mysql mysql 492 Jan 27 13:47 GRA_2_3102064_v2.log
-rw-rw---- 1 mysql mysql 412 Jan 27 13:47 GRA_2_3102293_v2.log
-rw-rw-r-- 1 mysql mysql 22 Feb 3 16:29 auth_plugin
-rw-r----- 1 mysql mysql 56 Feb 3 16:29 auto.cnf
-rw-r----- 1 mysql mysql 180 Feb 3 16:30 binlog.000001
-rw-r----- 1 mysql mysql 201 Feb 3 16:31 binlog.000002
-rw-r----- 1 mysql mysql 157 Feb 3 16:31 binlog.000003
-rw-r----- 1 mysql mysql 48 Feb 3 16:31 binlog.index
-rw-rw---- 1 mysql mysql 2147484952 Feb 3 16:37 galera.cache
-rwxr-xr-x 1 daemon daemon 1138 Feb 3 16:29 get-pxc-state
-rw-rw---- 1 mysql mysql 118 Feb 3 16:37 grastate.dat
-rw-r----- 1 mysql mysql 264 Feb 3 16:37 gvwstate.dat
-rw-r----- 1 mysql mysql 6638 Feb 3 16:30 ib_buffer_pool
-rw-r----- 1 mysql mysql 12582912 Feb 3 16:37 ibdata1
-rw-r----- 1 mysql mysql 12582912 Feb 3 16:30 ibtmp1
-rw-rw---- 1 mysql mysql 44185 Feb 3 16:31 innobackup.backup.log
-rw-rw---- 1 mysql mysql 76004458 Jan 27 14:02 innobackup.move.log
-rw-rw---- 1 mysql mysql 33821 Jan 27 14:02 innobackup.prepare.log
-rwxr-xr-x 1 daemon daemon 1708 Feb 3 16:29 liveness-check.sh
drwxr-s--- 2 mysql mysql 232 Feb 3 16:29 mysql
-rwxr-xr-x 1 daemon daemon 1740952 Feb 3 16:29 mysql-state-monitor
-rw-rw-r-- 1 mysql mysql 2046 Feb 3 16:30 mysql-state-monitor.log
-rw-r----- 1 mysql mysql 31457280 Feb 3 16:37 mysql.ibd
-rw-rw-r-- 1 mysql mysql 147 Feb 3 16:30 mysql.state
-rw-r----- 1 mysql mysql 792 Feb 3 16:30 mysqlcluster-pxc-0-slow.log
-rw-r----- 1 mysql mysql 2 Feb 3 16:30 mysqlcluster-pxc-0.pid
-rw-rw---- 1 mysql mysql 4489 Jan 27 14:02 mysqld.post.processing.log
srwxrwxrwx 1 mysql mysql 0 Feb 3 16:30 mysqlx.sock
-rw------- 1 mysql mysql 2 Feb 3 16:30 mysqlx.sock.lock
srwxr-xr-x 1 mysql mysql 0 Feb 3 16:29 notify.sock
-rwxr-xr-x 1 daemon daemon 3852289 Feb 3 16:29 peer-list
drwxr-s--- 2 mysql mysql 8192 Feb 3 16:29 performance_schema
-rwxr-xr-x 1 daemon daemon 830 Feb 3 16:29 pmm-prerun.sh
-rw------- 1 mysql mysql 1680 Feb 3 16:29 private_key.pem
-rw-r--r-- 1 mysql mysql 452 Feb 3 16:29 public_key.pem
-rwxr-xr-x 1 daemon daemon 5471 Feb 3 16:29 pxc-configure-pxc.sh
-rwxr-xr-x 1 daemon daemon 25159 Feb 3 16:29 pxc-entrypoint.sh
-rwxr-xr-x 1 daemon daemon 1470 Feb 3 16:29 readiness-check.sh
-rw-rw---- 1 mysql mysql 0 Jan 30 01:16 sst_in_progress
drwxr-s--- 2 mysql mysql 28 Feb 3 16:29 sys
-rw-r----- 1 mysql mysql 16777216 Feb 3 16:39 undo_001
-rw-r----- 1 mysql mysql 16777216 Feb 3 16:39 undo_002
-rwxr-xr-x 1 daemon daemon 543 Feb 3 16:29 unsafe-bootstrap.sh
-rw-rw-r-- 1 mysql mysql 142 Feb 3 16:30 version_info
-rwxr-xr-x 1 daemon daemon 618 Feb 3 16:29 wsrep_cmd_notify_handler.sh
-rw-rw-r-- 1 mysql mysql 98344 Jan 27 14:14 wsrep_recovery_verbose_history.log
Expected Result:
Expected that after OOM, the data is still present
Actual Result:
Tables etc were deleted
Additional Information:
How to debug this and know what caused this failure

