Hi,
We are using the Percona Operator on a self managed Kubernetes Cluster to deploy XtraDB. The cluster consists of 3 nodes (pods) and was running just fine until it run into a memory limitation and the one pod got OOMKilled. The remaining two nodes are still fine. I have since increased the pxc memory limits and redeployed with the increased resources. Unfortunately, the pod was still not able to start up and also got OOMKilled. I came across One of the pxc pods is getting OOM Killed on startup (and I assume recovery) - #4 by mtye and after deleting the galera.cache
file, the pod is finally starting.
However, as expected, it now need to rebuild and I can see SST is starting.
I can see data is copied from the donor nodes when looking at df -h output I see disk usage is growing, and the pod logs show this as well.
{"log":"2024-03-04T11:48:26.869657Z 0 [Note] [MY-000000] [WSREP-SST] \r joiner: => Rate:[ 121MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:08 Bytes: 80.7GiB \r joiner: => Rate:[ 116MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:18 Bytes: 81.8GiB \r joiner: => Rate:[ 117MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:28 Bytes: 83.0GiB \r joiner: => Rate:[ 116MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:38 Bytes: 84.1GiB \r joiner: => Rate:[ 113MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:48 Bytes: 85.2GiB \r joiner: => Rate:[ 115MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:58 Bytes: 86.4GiB \r joiner: => Rate:[ 113MiB/s] Avg:[ 113MiB/s] Elapsed:0:13:08 Bytes: 87.5GiB \r joiner: => Rate:[ 113MiB/s] Avg:[ 113MiB/s] Elapsed:0:13:18 Bytes: 88.6GiB \r joiner: => Rate:[ 105MiB/s] Avg:[ 113MiB/s] Elapsed:0:13:28 Bytes: 89.6GiB \r joiner: => Rate:[96.6MiB/s] Avg:[ 113MiB/s] Elapsed:0:13:38 Bytes: 90.5GiB \r joiner: => Rate:[90.7MiB/s] Avg:[ 112MiB/s] Elapsed:0:13:48 Bytes: 91.4GiB \r joiner: => Rate:[87.2MiB/s] Avg:[ 112MiB/s] Elapsed:0:13:58 Bytes: 92.3GiB \r joiner: => Rate:[85.0MiB/s] Avg:[ 112MiB/s] Elapsed:0:14:0\n","file":"/var/lib/mysql/mysqld-error.log"}
However, after about 17 minuets, the pod is killed and the recovery needs to start from the beginning again, and this end up in an endless loop. SST is running, but after 17 minuets the pod is killed. This is the error from the logs just before it is killed:
{"log":"2024-03-04T11:51:19.341323Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1216: Will never receive state. Need to abort.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:19.341343Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:19.341358Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:19.341481Z 0 [Note] [MY-000000] [Galera] gcomm: closing backend\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345186Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node\nview (view_id(NON_PRIM,21fafc24-940d,50)\nmemb {\n\t21fafc24-940d,0\n\t}\njoined {\n\t}\nleft {\n\t}\npartitioned {\n\t371ae06b-b41e,0\n\t}\n)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345245Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345253Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node\nview ((empty))\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345368Z 0 [Note] [MY-000000] [Galera] gcomm: closed\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345383Z 0 [Note] [MY-000000] [Galera] mysqld: Terminated.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345389Z 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345393Z 0 [Note] [MY-000000] [WSREP] Terminating SST process\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20Z UTC - mysqld got signal 11 ;\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.\nBuildID[sha1]=df9f6877fc91c9a71d439f27569eabdef408f622\nServer Version: 8.0.32-24.2 Percona XtraDB Cluster (GPL), Release rel24, Revision 2119e75, WSREP version 26.1.4.3, wsrep_26.1.4.3\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"Thread pointer: 0x0\nAttempting backtrace. You can use the following information to find out\nwhere mysqld died. If you see no messages after this, something went\nterribly wrong...\nstack_bottom = 0 thread_stack 0x100000\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345524Z 0 [Note] [MY-000000] [WSREP-SST] 8 Bytes: 93.1GiB \r joiner: => Rate:[76.7MiB/s] Avg:[ 111MiB/s] Elapsed:0:14:18 Bytes: 93.9GiB \r joiner: => Rate:[77.1MiB/s] Avg:[ 111MiB/s] Elapsed:0:14:28 Bytes: 94.6GiB \r joiner: => Rate:[69.9MiB/s] Avg:[ 111MiB/s] Elapsed:0:14:38 Bytes: 95.3GiB \r joiner: => Rate:[73.3MiB/s] Avg:[ 110MiB/s] Elapsed:0:14:48 Bytes: 96.0GiB \r joiner: => Rate:[53.6MiB/s] Avg:[ 109MiB/s] Elapsed:0:15:00 Bytes: 96.6GiB \r joiner: => Rate:[84.7KiB/s] Avg:[96.8MiB/s] Elapsed:0:17:02 Bytes: 96.6GiB \n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345969Z 0 [ERROR] [MY-000000] [WSREP-SST] Removing /var/lib/mysql//sst-xb-tmpdir/xtrabackup_galera_info file due to signal\n/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x2253a31]\n/usr/sbin/mysqld(print_fatal_signal(int)+0x39f) [0x1262d0f]\n/usr/sbin/mysqld(handle_fatal_signal+0xd8) [0x1262df8]\n/lib64/libpthread.so.0(+0x12ce0) [0x7f1994fe7ce0]\n/lib64/libc.so.6(abort+0x203) [0x7f1993369ee1]\n/usr/lib64/galera4/libgalera_smm.so(+0x17935) [0x7f19866c7935]\n/usr/lib64/galera4/libgalera_smm.so(+0x1ae20b) [0x7f198685e20b]\n/usr/lib64/galera4/libgalera_smm.so(+0x1b5f3d) [0x7f1986865f3d]\n/lib64/libpthread.so.0(+0x81cf) [0x7f1994fdd1cf]\n/lib64/libc.so.6(clone+0x43) [0x7f1993381dd3]\nYou may download the Percona XtraDB Cluster operations manual by visiting\nhttp://www.percona.com/software/percona-xtradb-cluster/. You may find information\nin the manual which will help you identify the cause of the crash.\nWriting a core file using lib coredumper\nPATH: (null)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.347545Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:143\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.671040Z 0 [ERROR] [MY-000000] [WSREP] Process was aborted.\n","file":"/var/lib/mysql/mysqld-error.log"}
In total, it has about 410GB to copy, however, since the pod is always restarted, it never gets to complete.
Any advise or recommended action to take here to allow the node to recover?
From what I can tell, the pod is not getting killed due to liveness or readiness probes, as they are passing. The pod also show an exist code of 2 in Kubernetes.
Last State: Terminated
Reason: Error
Exit Code: 2