MySQL Cluster goes into full crash every 5-10mins

Aurimas_Niekis · May 19, 2025, 11:30am

Hi everyone,

I’ve been testing Percona Everest with Percona XtraDB Cluster on our private (non production) Kubernetes (k3s) environment (hosted on big vSphere cluster). The goal is to validate and prepare deployment workflows before pushing anything to production.

Initially, I ran into a number of issues: SSO problems, missing annotations support on services, and a few quirks in the setup flow. I’ve submitted issues and PRs where appropriate and kept going, hoping things would be better.

Unfortunately, over the past few days, things have gone from shaky to downright unusable. The MySQL cluster now crashes repeatedly, seemingly at random. No data access, no connections, just a complete starts full crash. For example, just now, it restarted itself 3 times within 30 minutes while I was running very basic queries and inserts. It acts as if it just gives up and shuts down completely.

This is happening on a relatively clean k3s install, and I’m struggling to understand why such basic functionality is so unstable. I’ve tried other MySQL operators/setups on Kubernetes, and honestly, it’s surprising how far many of them feel from being production-grade.

If needed, I can attach relevant logs, we have full monitoring in place, so crash traces and events are all there. But at this point, it’s becoming extremely frustrating, and I’m not sure what’s breaking or why.

Has anyone else experienced similar issues with Everest + XtraDB? Any insights would be appreciated, especially if you’ve managed to make it stable for production environments.

Thanks,
Aurimas

Console log of 1 PXC Pod

gist:3d30311237dc4a91521f44fe81e2fea8 · GitHub

Diogo_Recharte · May 19, 2025, 1:23pm

@Aurimas_Niekis can you check if it’s the PMM sidecar container that’s getting OOM killed as reported here?

Mayank_Shah · May 19, 2025, 2:16pm

Hi @Aurimas_Niekis ,

In addition to what Diogo asked, can you please also share the following information:

The configuration of your DatabaseCluster
You can get this by running:

kubectl get databasecluster -n <your-namespace> <your-db-name> -oyaml

An estimate of the total data size on your database cluster
Details about your database usage patterns – approx. read/write per sec, number of concurrent users, etc.
Whether you have a Monitoring instance configured for your DB?
Number and size of your Kubernetes nodes

Aurimas_Niekis · May 19, 2025, 2:26pm

Hi @Diogo_Recharte,

Yes, I actually ran into issues with the PMM sidecar on day one. It was causing full node crashes (not just the pod), due to OOM. Because of that, I disabled monitoring entirely right away.

So currently, PMM is not running in the cluster.

Mayank_Shah:

Hi @Aurimas_Niekis ,

In addition to what Diogo asked, can you please also share the following information:

The configuration of your DatabaseCluster
You can get this by running:
kubectl get databasecluster -n <your-namespace> <your-db-name> -oyaml
An estimate of the total data size on your database cluster

Details about your database usage patterns – approx. read/write per sec, number of concurrent users, etc.

Whether you have a Monitoring instance configured for your DB?

Number and size of your Kubernetes nodes

Hi @Mayank_Shah,

The cluster was created using the Everest UI with the default “mid-sized” configuration, 3 MySQL nodes.
The data volume is minimal, just a single DB with 2 tables and around 100 rows total. So basically an empty cluster.
Usage pattern is idle most of the time. Today I was testing some simple queries and inserts, and that’s when the crashing started happening more frequently.
As I mentioned earlier, I had to disable PMM monitoring on day one, the PMM sidecar was causing OOM kills not just at the pod level, but crashing the entire node, which was a serious issue. So currently no monitoring is enabled.
The cluster runs on a 6-node k3s setup (vSphere), each node with 16 CPUs and 16GB RAM.

Mayank_Shah · May 20, 2025, 7:30am

Thanks @Aurimas_Niekis ,

When you say that MySQL cluster crashes, do you mean that the Pods go into the CrashLoopBackoff state, or do they just restart (i.e, go into Terminating state and new pods are created)

Mayank_Shah · May 20, 2025, 8:25am

Since I don’t have a VMWare vSphere environment handy, I tried to reproduce this on a k3s installation on top of VMs created using GCP Compute Engine. Here’s my setup:

6 node cluster (GCP Compute Engine) - each 4 vCPU, 16GB memory, running x86/64 Ubuntu
Completely disabled firewall
Percona Everest 1.6.0
Created a single PXC (Mysql) instance from the UI - 3 nodes, and the default “medium” settings
Ran a basic load test (20 concurrent clients, avg 50 queries per client)

I cannot see any crashes or restarts of the pods. So far I’ve had it running for little over an hour. I’m not sure if this issue is specific to your VMWare environment, but when I get the chance, I will try to set this up on VMWare nodes.

Can you help me by providing some more information:

Logs from the percona-xtradb-cluster-operator
Logs from the DB pods before the crash takes place (maybe you can use the kubectl logs --previous command?)
Any events from k8s that might suggest a failing liveness/readiness probe?

Aurimas_Niekis · May 20, 2025, 8:35am

I mean in pods exits and everything stops and tries to recover:

#####################################################
FULL_PXC_CLUSTER_CRASH:quee-pxc-1.queepxc.everest.svc.cluster.local
#####################################################

Explore-logs-logs-data-2025-05-20 17_29_32.csv · GitHub
I posted logs from single pod before crash untilfull crash
Yes, IIRC the haproxy liveness starts failing, then pxc pods start failing

image1920×1002 187 KB

Mayank_Shah · May 20, 2025, 10:14am

The PXC logs seem to suggest that the PXC node is not able to join the cluster. This could happen if the inter-pod network connectivity is broken. Can you please check this? Perhaps you have some firewall rules or a NetworkPolicy or some DNS configuration that might be blocking it? Each pod should be able to access the following ports on the other pod: 3306, 4444, 4567, 4568, 33062, 33062

Aurimas_Niekis · May 20, 2025, 10:19am

Thanks for the response. Just to clarify, everything else in the cluster is running fine, and this cluster was working properly earlier as well. No changes were made to the configuration, network policies, or DNS settings before the issues started. Other workloads are still functioning normally…

Aurimas_Niekis · May 20, 2025, 10:28am

Here are 4k lines from one of the pods from now:

gist.github.com

https://gist.github.com/aurimasniekis/3e055f829f04630082eb2aadab841d11

gistfile1.txt

Common labels: 
Line limit: "4000 reached, received logs cover 65.52% (24h 28min 42sec) of your selected time range (37h 21min 29sec)"
Total bytes processed: "2.06  MB"


1747728603492	2025-05-20T08:10:03.492376Z 43547 [ERROR] [MY-000000] [WSREP] Percona-XtraDB-Cluster prohibits use of LOCK TABLE/FLUSH TABLE <table> WITH READ LOCK/FOR EXPORT with pxc_strict_mode = ENFORCING
1747711573242	2025-05-20T03:26:13.242115Z 0 [Note] [MY-000000] [Galera] (e608ddf5-9a95, 'ssl://0.0.0.0:4567') turning message relay requesting off
1747711567750	2025-05-20T03:26:07.750370Z 0 [Note] [MY-000000] [Galera] Member 0.0 (quee-pxc-2) synced with group.
1747711567749	2025-05-20T03:26:07.749818Z 0 [Note] [MY-000000] [Galera] 0.0 (quee-pxc-2): State transfer from 1.0 (quee-pxc-1) complete.
1747711566956	2025-05-20T03:26:06.956680Z 0 [Note] [MY-000000] [Galera] Member 1.0 (quee-pxc-1) synced with group.

This file has been truncated. show original

Mayank_Shah · May 20, 2025, 11:31am

Thanks for confirming @Aurimas_Niekis

Just so that we can actually rule out pod connectivity, please can you help us by validating this by running the below commands on your cluster?

Get the Pod IP address of the -pxc-0 pod

DBNAME=<your DB name>
DBNAMESPACE=<your DB namespace>
TARGET_IP=$(kubectl get pod -n $DBNAMESPACE $DBNAME-pxc-0 -ojsonpath='{.status.podIP}')

Attach this debug container to -pxc-1 pod:

kubectl debug -it --container=netshoot --image=nicolaka/netshoot --target=pxc $DBNAME-pxc-1 -n $DBNAMESPACE --env TARGET_IP=$TARGET_IP

In the interactive shell that opens, run the following commands:

ping $TARGET_IP

nc -zv $TARGET_IP 3306
nc -zv $TARGET_IP 33060

If these commands succeed, then you don’t have an issue with the pod connectivity …

Aurimas_Niekis · May 20, 2025, 11:35am

> DBNAME=quee
> DBNAMESPACE=everest
> TARGET_IP=$(kubectl get pod -n $DBNAMESPACE $DBNAME-pxc-0 -ojsonpath='{.status.podIP}')
> kubectl debug -it --container=netshoot --image=nicolaka/netshoot --target=pxc $DBNAME-pxc-1 -n $DBNAMESPACE --env TARGET_IP=$TARGET_IP
Targeting container "pxc". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
If you don't see a command prompt, try pressing enter.
                    dP            dP                           dP   
                    88            88                           88   
88d888b. .d8888b. d8888P .d8888b. 88d888b. .d8888b. .d8888b. d8888P 
88'  `88 88ooood8   88   Y8ooooo. 88'  `88 88'  `88 88'  `88   88   
88    88 88.  ...   88         88 88    88 88.  .88 88.  .88   88   
dP    dP `88888P'   dP   `88888P' dP    dP `88888P' `88888P'   dP   
                                                                    
Welcome to Netshoot! (github.com/nicolaka/netshoot)
Version: 0.13

                                         



 quee-pxc-1  ~  ping $TARGET_IP

PING 10.42.1.39 (10.42.1.39) 56(84) bytes of data.
64 bytes from 10.42.1.39: icmp_seq=1 ttl=62 time=0.238 ms
64 bytes from 10.42.1.39: icmp_seq=2 ttl=62 time=0.173 ms
64 bytes from 10.42.1.39: icmp_seq=3 ttl=62 time=0.199 ms
^C
--- 10.42.1.39 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2027ms
rtt min/avg/max/mdev = 0.173/0.203/0.238/0.026 ms

 quee-pxc-1  ~  nc -zv $TARGET_IP 3306
Connection to 10.42.1.39 3306 port [tcp/mysql] succeeded!

 quee-pxc-1  ~  nc -zv $TARGET_IP 33060

Connection to 10.42.1.39 33060 port [tcp/*] succeeded!

 quee-pxc-1  ~ 

Slava_Sarzhan · May 20, 2025, 12:01pm

Unfortunately, over the past few days, things have gone from shaky to downright unusable. The MySQL cluster now crashes repeatedly, seemingly at random. No data access, no connections, just a complete starts full crash. For example, just now, it restarted itself 3 times within 30 minutes while I was running very basic queries and inserts. It acts as if it just gives up and shuts down completely.

Do you see any MySQL errors when you try to insert data? We need to understand why pods are restarted altogether, try to check events and so on. As I can see, you had a full crash, but the cluster was able to recover, but after that, we had new crash.

Aurimas_Niekis · May 20, 2025, 12:10pm

Thanks for the follow-up. We didn’t see any MySQL errors during inserts, only timeouts and “connection refused” messages. We’ve since switched to a basic MySQL instance on bare metal, as the current setup has been quite unstable for us.

The cluster crashes either immediately after a restart or 10-15 minutes later, even with no active usage. No one is using it right now, and it still crashed about 30 minutes ago. From the events, it looks like the liveness probes failed, which triggered the pod restarts…

Aurimas_Niekis · May 20, 2025, 12:17pm

From what I can see, it looks like either the MySQL server is freezing and stops responding to connections, or something else is causing it to hang. The liveness probe script seems to be timing out consistently. Based on the events, I’m seeing messages like:

Liveness probe failed: command timed out: "/var/lib/mysql/liveness-check.sh" timed out after 7m30s

Slava_Sarzhan · May 20, 2025, 12:24pm

Tt seems that liveness can’t connect to mysql at all at the same time you do not see any mysql errors. Maybe something is wrong with storage … As you can see from percona-xtradb-cluster-operator/build/liveness-check.sh at main · percona/percona-xtradb-cluster-operator · GitHub you can just manually create ‘/var/lib/mysql/sleep-forever’ file to disable liveness and readiness probes to be able to connect and perform some debug manually.

Topic		Replies	Views
Percona XtraDB Cluster crashes periodically Percona XtraDB Cluster 5.x	11	2488	April 28, 2015
Percona XtraDB node crashing randomly without logs Percona XtraDB Cluster 5.x mysql , percona	7	603	November 20, 2023
I am getting CrashLookBackOff error for Percona XtraDB Cluster 8.0.20 in Kubernetes Percona Distribution for MySQL percona	8	3109	November 17, 2020
xtradb cluster crashing problem Percona XtraDB Cluster 5.x	1	1672	January 8, 2013
Cluster crash when pod Re-join after been killed Percona XtraDB Cluster 5.x	1	494	November 29, 2018

MySQL Cluster goes into full crash every 5-10mins

Related topics