MySQL Cluster goes into full crash every 5-10mins

Hi everyone,

I’ve been testing Percona Everest with Percona XtraDB Cluster on our private (non production) Kubernetes (k3s) environment (hosted on big vSphere cluster). The goal is to validate and prepare deployment workflows before pushing anything to production.

Initially, I ran into a number of issues: SSO problems, missing annotations support on services, and a few quirks in the setup flow. I’ve submitted issues and PRs where appropriate and kept going, hoping things would be better.

Unfortunately, over the past few days, things have gone from shaky to downright unusable. The MySQL cluster now crashes repeatedly, seemingly at random. No data access, no connections, just a complete starts full crash. For example, just now, it restarted itself 3 times within 30 minutes while I was running very basic queries and inserts. It acts as if it just gives up and shuts down completely.

This is happening on a relatively clean k3s install, and I’m struggling to understand why such basic functionality is so unstable. I’ve tried other MySQL operators/setups on Kubernetes, and honestly, it’s surprising how far many of them feel from being production-grade.

If needed, I can attach relevant logs, we have full monitoring in place, so crash traces and events are all there. But at this point, it’s becoming extremely frustrating, and I’m not sure what’s breaking or why.

Has anyone else experienced similar issues with Everest + XtraDB? Any insights would be appreciated, especially if you’ve managed to make it stable for production environments.

Thanks,
Aurimas

Console log of 1 PXC Pod

gist:3d30311237dc4a91521f44fe81e2fea8 · GitHub

@Aurimas_Niekis can you check if it’s the PMM sidecar container that’s getting OOM killed as reported here?

Hi @Aurimas_Niekis ,

In addition to what Diogo asked, can you please also share the following information:

  1. The configuration of your DatabaseCluster
    You can get this by running:
kubectl get databasecluster -n <your-namespace> <your-db-name> -oyaml
  1. An estimate of the total data size on your database cluster
  2. Details about your database usage patterns – approx. read/write per sec, number of concurrent users, etc.
  3. Whether you have a Monitoring instance configured for your DB?
  4. Number and size of your Kubernetes nodes

Hi @Diogo_Recharte,

Yes, I actually ran into issues with the PMM sidecar on day one. It was causing full node crashes (not just the pod), due to OOM. Because of that, I disabled monitoring entirely right away.

So currently, PMM is not running in the cluster.

Hi @Mayank_Shah,

  1. The cluster was created using the Everest UI with the default “mid-sized” configuration, 3 MySQL nodes.
  2. The data volume is minimal, just a single DB with 2 tables and around 100 rows total. So basically an empty cluster.
  3. Usage pattern is idle most of the time. Today I was testing some simple queries and inserts, and that’s when the crashing started happening more frequently.
  4. As I mentioned earlier, I had to disable PMM monitoring on day one, the PMM sidecar was causing OOM kills not just at the pod level, but crashing the entire node, which was a serious issue. So currently no monitoring is enabled.
  5. The cluster runs on a 6-node k3s setup (vSphere), each node with 16 CPUs and 16GB RAM.

Thanks @Aurimas_Niekis ,

When you say that MySQL cluster crashes, do you mean that the Pods go into the CrashLoopBackoff state, or do they just restart (i.e, go into Terminating state and new pods are created)

Since I don’t have a VMWare vSphere environment handy, I tried to reproduce this on a k3s installation on top of VMs created using GCP Compute Engine. Here’s my setup:

  • 6 node cluster (GCP Compute Engine) - each 4 vCPU, 16GB memory, running x86/64 Ubuntu
  • Completely disabled firewall
  • Percona Everest 1.6.0
  • Created a single PXC (Mysql) instance from the UI - 3 nodes, and the default “medium” settings
  • Ran a basic load test (20 concurrent clients, avg 50 queries per client)

I cannot see any crashes or restarts of the pods. So far I’ve had it running for little over an hour. I’m not sure if this issue is specific to your VMWare environment, but when I get the chance, I will try to set this up on VMWare nodes.

Can you help me by providing some more information:

  • Logs from the percona-xtradb-cluster-operator
  • Logs from the DB pods before the crash takes place (maybe you can use the kubectl logs --previous command?)
  • Any events from k8s that might suggest a failing liveness/readiness probe?

I mean in pods exits and everything stops and tries to recover:

#####################################################
FULL_PXC_CLUSTER_CRASH:quee-pxc-1.queepxc.everest.svc.cluster.local
#####################################################
  1. Explore-logs-logs-data-2025-05-20 17_29_32.csv · GitHub
  2. I posted logs from single pod before crash untilfull crash
  3. Yes, IIRC the haproxy liveness starts failing, then pxc pods start failing

The PXC logs seem to suggest that the PXC node is not able to join the cluster. This could happen if the inter-pod network connectivity is broken. Can you please check this? Perhaps you have some firewall rules or a NetworkPolicy or some DNS configuration that might be blocking it? Each pod should be able to access the following ports on the other pod: 3306, 4444, 4567, 4568, 33062, 33062

Thanks for the response. Just to clarify, everything else in the cluster is running fine, and this cluster was working properly earlier as well. No changes were made to the configuration, network policies, or DNS settings before the issues started. Other workloads are still functioning normally…

Here are 4k lines from one of the pods from now:

Thanks for confirming @Aurimas_Niekis

Just so that we can actually rule out pod connectivity, please can you help us by validating this by running the below commands on your cluster?

  1. Get the Pod IP address of the -pxc-0 pod
DBNAME=<your DB name>
DBNAMESPACE=<your DB namespace>
TARGET_IP=$(kubectl get pod -n $DBNAMESPACE $DBNAME-pxc-0 -ojsonpath='{.status.podIP}')
  1. Attach this debug container to -pxc-1 pod:
kubectl debug -it --container=netshoot --image=nicolaka/netshoot --target=pxc $DBNAME-pxc-1 -n $DBNAMESPACE --env TARGET_IP=$TARGET_IP
  1. In the interactive shell that opens, run the following commands:
ping $TARGET_IP

nc -zv $TARGET_IP 3306
nc -zv $TARGET_IP 33060

If these commands succeed, then you don’t have an issue with the pod connectivity …