Frequent sporadic "MySQL server has gone away" errors with Operator for MySQL (XtraDB Cluster) and HAProxy

Description:

We are running a Percona Operator for MySQL (XtraDB Cluster) cluster with 3 PXC replicas and 3 HAProxy replicas. This cluster is serving a PHP application that connects using PDO to the <cluster-name>-haproxy service.

The cluster appears to be healthy; however every few minutes we see an error like this (as reported from PHP):

SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

Sometimes, this message is seen instead:

SQLSTATE[08S01]: Communication link failure: 1158 Got an error reading communication packets

Debugging attempted so far

We disabled HAProxy and connected directly to the <cluster-name>-pxc service instead. With this change applied, no further errors were encountered during the whole time the change was live. On switching back to <cluster-name>-haproxy, the errors have begun again.

We are using the default HAProxy config provided by the operator and I’m unsure where to start in trying to resolve this problem. I checked the HAProxy logs and there are a mixture of CD and SD error codes for termination_state (as far as I can see), e.g.:

[pod/mysql-haproxy-0/haproxy] {"time":"16/May/2025:07:52:20.337", "client_ip": "10.244.6.140", "client_port":"35524", "backend_source_ip": "10.244.7.208", "backend_source_port": "34858", "frontend_name": "galera-in", "backend_name": "galera-nodes", "server_name":"mysql-pxc-0", "tw": "1", "tc": "1", "Tt": "2", "bytes_read": "83", "termination_state": "SD", "actconn": "233", "feconn" :"232", "beconn": "231", "srv_conn": "231", "retries": "0", "srv_queue": "0", "backend_queue": "0" }

[pod/mysql-haproxy-0/haproxy] {"time":"16/May/2025:07:47:07.227", "client_ip": "10.244.6.140", "client_port":"46498", "backend_source_ip": "10.244.7.208", "backend_source_port": "40846", "frontend_name": "galera-in", "backend_name": "galera-nodes", "server_name":"mysql-pxc-0", "tw": "1", "tc": "148", "Tt": "318405", "bytes_read": "1189188", "termination_state": "CD", "actconn": "245", "feconn" :"244", "beconn": "243", "srv_conn": "243", "retries": "0", "srv_queue": "0", "backend_queue": "0" }

Any guidance on what to change to solve this would be very much appreciated please, as these errors are coming in regularly.

Version:

  • Operator: 1.15.0
  • PXC: percona/percona-xtradb-cluster:8.0.35
  • HAProxy: percona/haproxy:2.8.5

Hi @jhwalker, please check this task Jira
Do you have any errors/warnings in PXC log?

Thanks for the reply; unfortunately, it looks like I can’t see this task:

(Logged into JIRA using my forum email address)

There were no errors or warnings showing in the log.


Please let me know any other information you need to help. We are currently bypassing HAProxy again because this is a production service with PHP serving up to 100 reqs/second in peak time. We see typically about 1-2 errors per minute, sometimes more, so a very small percentage of overall requests, but enough to impact users and drown us in error reports.

We can start to use HAProxy again temporarily to collect debugging data. In the meantime, we are connecting to PXC via a custom Kubernetes Service that always targets the <cluster-name>-pxc-0 Pod; we will failover manually by updating the Service’s selector if the -0 Pod fails and we need to switch to a replica.