Percona XtraDB Node not able to start up - k8s


We are using the Percona Operator on a self managed Kubernetes Cluster to deploy XtraDB. The cluster consists of 3 nodes (pods) and was running just fine until it run into a memory limitation and the one pod got OOMKilled. The remaining two nodes are still fine. I have since increased the pxc memory limits and redeployed with the increased resources. Unfortunately, the pod was still not able to start up and also got OOMKilled. I came across One of the pxc pods is getting OOM Killed on startup (and I assume recovery) - #4 by mtye and after deleting the galera.cache file, the pod is finally starting.

However, as expected, it now need to rebuild and I can see SST is starting.

I can see data is copied from the donor nodes when looking at df -h output I see disk usage is growing, and the pod logs show this as well.

{"log":"2024-03-04T11:48:26.869657Z 0 [Note] [MY-000000] [WSREP-SST]  \r   joiner: => Rate:[ 121MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:08  Bytes: 80.7GiB \r   joiner: => Rate:[ 116MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:18  Bytes: 81.8GiB \r   joiner: => Rate:[ 117MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:28  Bytes: 83.0GiB \r   joiner: => Rate:[ 116MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:38  Bytes: 84.1GiB \r   joiner: => Rate:[ 113MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:48  Bytes: 85.2GiB \r   joiner: => Rate:[ 115MiB/s] Avg:[ 113MiB/s] Elapsed:0:12:58  Bytes: 86.4GiB \r   joiner: => Rate:[ 113MiB/s] Avg:[ 113MiB/s] Elapsed:0:13:08  Bytes: 87.5GiB \r   joiner: => Rate:[ 113MiB/s] Avg:[ 113MiB/s] Elapsed:0:13:18  Bytes: 88.6GiB \r   joiner: => Rate:[ 105MiB/s] Avg:[ 113MiB/s] Elapsed:0:13:28  Bytes: 89.6GiB \r   joiner: => Rate:[96.6MiB/s] Avg:[ 113MiB/s] Elapsed:0:13:38  Bytes: 90.5GiB \r   joiner: => Rate:[90.7MiB/s] Avg:[ 112MiB/s] Elapsed:0:13:48  Bytes: 91.4GiB \r   joiner: => Rate:[87.2MiB/s] Avg:[ 112MiB/s] Elapsed:0:13:58  Bytes: 92.3GiB \r   joiner: => Rate:[85.0MiB/s] Avg:[ 112MiB/s] Elapsed:0:14:0\n","file":"/var/lib/mysql/mysqld-error.log"}

However, after about 17 minuets, the pod is killed and the recovery needs to start from the beginning again, and this end up in an endless loop. SST is running, but after 17 minuets the pod is killed. This is the error from the logs just before it is killed:

{"log":"2024-03-04T11:51:19.341323Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1216: Will never receive state. Need to abort.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:19.341343Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:19.341358Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:19.341481Z 0 [Note] [MY-000000] [Galera] gcomm: closing backend\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345186Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node\nview (view_id(NON_PRIM,21fafc24-940d,50)\nmemb {\n\t21fafc24-940d,0\n\t}\njoined {\n\t}\nleft {\n\t}\npartitioned {\n\t371ae06b-b41e,0\n\t}\n)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345245Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345253Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node\nview ((empty))\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345368Z 0 [Note] [MY-000000] [Galera] gcomm: closed\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345383Z 0 [Note] [MY-000000] [Galera] mysqld: Terminated.\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345389Z 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345393Z 0 [Note] [MY-000000] [WSREP] Terminating SST process\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20Z UTC - mysqld got signal 11 ;\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.\nBuildID[sha1]=df9f6877fc91c9a71d439f27569eabdef408f622\nServer Version: 8.0.32-24.2 Percona XtraDB Cluster (GPL), Release rel24, Revision 2119e75, WSREP version, wsrep_26.1.4.3\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"Thread pointer: 0x0\nAttempting backtrace. You can use the following information to find out\nwhere mysqld died. If you see no messages after this, something went\nterribly wrong...\nstack_bottom = 0 thread_stack 0x100000\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345524Z 0 [Note] [MY-000000] [WSREP-SST] 8  Bytes: 93.1GiB \r   joiner: => Rate:[76.7MiB/s] Avg:[ 111MiB/s] Elapsed:0:14:18  Bytes: 93.9GiB \r   joiner: => Rate:[77.1MiB/s] Avg:[ 111MiB/s] Elapsed:0:14:28  Bytes: 94.6GiB \r   joiner: => Rate:[69.9MiB/s] Avg:[ 111MiB/s] Elapsed:0:14:38  Bytes: 95.3GiB \r   joiner: => Rate:[73.3MiB/s] Avg:[ 110MiB/s] Elapsed:0:14:48  Bytes: 96.0GiB \r   joiner: => Rate:[53.6MiB/s] Avg:[ 109MiB/s] Elapsed:0:15:00  Bytes: 96.6GiB \r   joiner: => Rate:[84.7KiB/s] Avg:[96.8MiB/s] Elapsed:0:17:02  Bytes: 96.6GiB \n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.345969Z 0 [ERROR] [MY-000000] [WSREP-SST] Removing /var/lib/mysql//sst-xb-tmpdir/xtrabackup_galera_info file due to signal\n/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x2253a31]\n/usr/sbin/mysqld(print_fatal_signal(int)+0x39f) [0x1262d0f]\n/usr/sbin/mysqld(handle_fatal_signal+0xd8) [0x1262df8]\n/lib64/ [0x7f1994fe7ce0]\n/lib64/ [0x7f1993369ee1]\n/usr/lib64/galera4/ [0x7f19866c7935]\n/usr/lib64/galera4/ [0x7f198685e20b]\n/usr/lib64/galera4/ [0x7f1986865f3d]\n/lib64/ [0x7f1994fdd1cf]\n/lib64/ [0x7f1993381dd3]\nYou may download the Percona XtraDB Cluster operations manual by visiting\n You may find information\nin the manual which will help you identify the cause of the crash.\nWriting a core file using lib coredumper\nPATH: (null)\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.347545Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:143\n","file":"/var/lib/mysql/mysqld-error.log"}
{"log":"2024-03-04T11:51:20.671040Z 0 [ERROR] [MY-000000] [WSREP] Process was aborted.\n","file":"/var/lib/mysql/mysqld-error.log"}

In total, it has about 410GB to copy, however, since the pod is always restarted, it never gets to complete.

Any advise or recommended action to take here to allow the node to recover?

From what I can tell, the pod is not getting killed due to liveness or readiness probes, as they are passing. The pod also show an exist code of 2 in Kubernetes.

Last State:     Terminated
      Reason:       Error
      Exit Code:    2


Did you also tried force SST option ?

kubectl -n <your namespace> delete pod <the pod to resynchronize>

So, the node will come again using SST automatically.

If still not used, let’s try with this and let us know if that works or not.

Meanwhile, can you please gather below details for a review ?

kubectl logs cluster1-pxc-2 --prefix --all-containers > pxc2.log
kubectl describe pod cluster1-pxc-2 > pxc2_describe.log

Note:- Please change the pod and namespace details as per your environment.

If feasible inside of containers, please provide the innobackup.backup.log from donor and innobackup.prepare.log/innobackup.move.log from the joiner node ?

Also, I would like to review your deployment[cr.yaml] file. Please furnish this as well.