MySQL Cluster goes into full crash every 5-10mins

Hi everyone,

I’ve been testing Percona Everest with Percona XtraDB Cluster on our private (non production) Kubernetes (k3s) environment (hosted on big vSphere cluster). The goal is to validate and prepare deployment workflows before pushing anything to production.

Initially, I ran into a number of issues: SSO problems, missing annotations support on services, and a few quirks in the setup flow. I’ve submitted issues and PRs where appropriate and kept going, hoping things would be better.

Unfortunately, over the past few days, things have gone from shaky to downright unusable. The MySQL cluster now crashes repeatedly, seemingly at random. No data access, no connections, just a complete starts full crash. For example, just now, it restarted itself 3 times within 30 minutes while I was running very basic queries and inserts. It acts as if it just gives up and shuts down completely.

This is happening on a relatively clean k3s install, and I’m struggling to understand why such basic functionality is so unstable. I’ve tried other MySQL operators/setups on Kubernetes, and honestly, it’s surprising how far many of them feel from being production-grade.

If needed, I can attach relevant logs, we have full monitoring in place, so crash traces and events are all there. But at this point, it’s becoming extremely frustrating, and I’m not sure what’s breaking or why.

Has anyone else experienced similar issues with Everest + XtraDB? Any insights would be appreciated, especially if you’ve managed to make it stable for production environments.

Thanks,
Aurimas

Console log of 1 PXC Pod

gist:3d30311237dc4a91521f44fe81e2fea8 · GitHub

@Aurimas_Niekis can you check if it’s the PMM sidecar container that’s getting OOM killed as reported here?

Hi @Aurimas_Niekis ,

In addition to what Diogo asked, can you please also share the following information:

  1. The configuration of your DatabaseCluster
    You can get this by running:
kubectl get databasecluster -n <your-namespace> <your-db-name> -oyaml
  1. An estimate of the total data size on your database cluster
  2. Details about your database usage patterns – approx. read/write per sec, number of concurrent users, etc.
  3. Whether you have a Monitoring instance configured for your DB?
  4. Number and size of your Kubernetes nodes

Hi @Diogo_Recharte,

Yes, I actually ran into issues with the PMM sidecar on day one. It was causing full node crashes (not just the pod), due to OOM. Because of that, I disabled monitoring entirely right away.

So currently, PMM is not running in the cluster.

Hi @Mayank_Shah,

  1. The cluster was created using the Everest UI with the default “mid-sized” configuration, 3 MySQL nodes.
  2. The data volume is minimal, just a single DB with 2 tables and around 100 rows total. So basically an empty cluster.
  3. Usage pattern is idle most of the time. Today I was testing some simple queries and inserts, and that’s when the crashing started happening more frequently.
  4. As I mentioned earlier, I had to disable PMM monitoring on day one, the PMM sidecar was causing OOM kills not just at the pod level, but crashing the entire node, which was a serious issue. So currently no monitoring is enabled.
  5. The cluster runs on a 6-node k3s setup (vSphere), each node with 16 CPUs and 16GB RAM.

Thanks @Aurimas_Niekis ,

When you say that MySQL cluster crashes, do you mean that the Pods go into the CrashLoopBackoff state, or do they just restart (i.e, go into Terminating state and new pods are created)

Since I don’t have a VMWare vSphere environment handy, I tried to reproduce this on a k3s installation on top of VMs created using GCP Compute Engine. Here’s my setup:

  • 6 node cluster (GCP Compute Engine) - each 4 vCPU, 16GB memory, running x86/64 Ubuntu
  • Completely disabled firewall
  • Percona Everest 1.6.0
  • Created a single PXC (Mysql) instance from the UI - 3 nodes, and the default “medium” settings
  • Ran a basic load test (20 concurrent clients, avg 50 queries per client)

I cannot see any crashes or restarts of the pods. So far I’ve had it running for little over an hour. I’m not sure if this issue is specific to your VMWare environment, but when I get the chance, I will try to set this up on VMWare nodes.

Can you help me by providing some more information:

  • Logs from the percona-xtradb-cluster-operator
  • Logs from the DB pods before the crash takes place (maybe you can use the kubectl logs --previous command?)
  • Any events from k8s that might suggest a failing liveness/readiness probe?

I mean in pods exits and everything stops and tries to recover:

#####################################################
FULL_PXC_CLUSTER_CRASH:quee-pxc-1.queepxc.everest.svc.cluster.local
#####################################################
  1. Explore-logs-logs-data-2025-05-20 17_29_32.csv · GitHub
  2. I posted logs from single pod before crash untilfull crash
  3. Yes, IIRC the haproxy liveness starts failing, then pxc pods start failing

The PXC logs seem to suggest that the PXC node is not able to join the cluster. This could happen if the inter-pod network connectivity is broken. Can you please check this? Perhaps you have some firewall rules or a NetworkPolicy or some DNS configuration that might be blocking it? Each pod should be able to access the following ports on the other pod: 3306, 4444, 4567, 4568, 33062, 33062

Thanks for the response. Just to clarify, everything else in the cluster is running fine, and this cluster was working properly earlier as well. No changes were made to the configuration, network policies, or DNS settings before the issues started. Other workloads are still functioning normally…

Here are 4k lines from one of the pods from now:

Thanks for confirming @Aurimas_Niekis

Just so that we can actually rule out pod connectivity, please can you help us by validating this by running the below commands on your cluster?

  1. Get the Pod IP address of the -pxc-0 pod
DBNAME=<your DB name>
DBNAMESPACE=<your DB namespace>
TARGET_IP=$(kubectl get pod -n $DBNAMESPACE $DBNAME-pxc-0 -ojsonpath='{.status.podIP}')
  1. Attach this debug container to -pxc-1 pod:
kubectl debug -it --container=netshoot --image=nicolaka/netshoot --target=pxc $DBNAME-pxc-1 -n $DBNAMESPACE --env TARGET_IP=$TARGET_IP
  1. In the interactive shell that opens, run the following commands:
ping $TARGET_IP

nc -zv $TARGET_IP 3306
nc -zv $TARGET_IP 33060

If these commands succeed, then you don’t have an issue with the pod connectivity …

> DBNAME=quee
> DBNAMESPACE=everest
> TARGET_IP=$(kubectl get pod -n $DBNAMESPACE $DBNAME-pxc-0 -ojsonpath='{.status.podIP}')
> kubectl debug -it --container=netshoot --image=nicolaka/netshoot --target=pxc $DBNAME-pxc-1 -n $DBNAMESPACE --env TARGET_IP=$TARGET_IP
Targeting container "pxc". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
If you don't see a command prompt, try pressing enter.
                    dP            dP                           dP   
                    88            88                           88   
88d888b. .d8888b. d8888P .d8888b. 88d888b. .d8888b. .d8888b. d8888P 
88'  `88 88ooood8   88   Y8ooooo. 88'  `88 88'  `88 88'  `88   88   
88    88 88.  ...   88         88 88    88 88.  .88 88.  .88   88   
dP    dP `88888P'   dP   `88888P' dP    dP `88888P' `88888P'   dP   
                                                                    
Welcome to Netshoot! (github.com/nicolaka/netshoot)
Version: 0.13

                                         



 quee-pxc-1  ~  ping $TARGET_IP

PING 10.42.1.39 (10.42.1.39) 56(84) bytes of data.
64 bytes from 10.42.1.39: icmp_seq=1 ttl=62 time=0.238 ms
64 bytes from 10.42.1.39: icmp_seq=2 ttl=62 time=0.173 ms
64 bytes from 10.42.1.39: icmp_seq=3 ttl=62 time=0.199 ms
^C
--- 10.42.1.39 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2027ms
rtt min/avg/max/mdev = 0.173/0.203/0.238/0.026 ms

 quee-pxc-1  ~  nc -zv $TARGET_IP 3306
Connection to 10.42.1.39 3306 port [tcp/mysql] succeeded!

 quee-pxc-1  ~  nc -zv $TARGET_IP 33060

Connection to 10.42.1.39 33060 port [tcp/*] succeeded!

 quee-pxc-1  ~  

Unfortunately, over the past few days, things have gone from shaky to downright unusable. The MySQL cluster now crashes repeatedly, seemingly at random. No data access, no connections, just a complete starts full crash. For example, just now, it restarted itself 3 times within 30 minutes while I was running very basic queries and inserts. It acts as if it just gives up and shuts down completely.

Do you see any MySQL errors when you try to insert data? We need to understand why pods are restarted altogether, try to check events and so on. As I can see, you had a full crash, but the cluster was able to recover, but after that, we had new crash.

Thanks for the follow-up. We didn’t see any MySQL errors during inserts, only timeouts and “connection refused” messages. We’ve since switched to a basic MySQL instance on bare metal, as the current setup has been quite unstable for us.

The cluster crashes either immediately after a restart or 10-15 minutes later, even with no active usage. No one is using it right now, and it still crashed about 30 minutes ago. From the events, it looks like the liveness probes failed, which triggered the pod restarts…

From what I can see, it looks like either the MySQL server is freezing and stops responding to connections, or something else is causing it to hang. The liveness probe script seems to be timing out consistently. Based on the events, I’m seeing messages like:

Liveness probe failed: command timed out: "/var/lib/mysql/liveness-check.sh" timed out after 7m30s

Tt seems that liveness can’t connect to mysql at all at the same time you do not see any mysql errors. Maybe something is wrong with storage … As you can see from percona-xtradb-cluster-operator/build/liveness-check.sh at main · percona/percona-xtradb-cluster-operator · GitHub you can just manually create ‘/var/lib/mysql/sleep-forever’ file to disable liveness and readiness probes to be able to connect and perform some debug manually.