[Frequent But Random] General error: 2013 Lost connection to MySQL server during query on simple select with limit

Hi,

I apologize for opening a new topic related to the “Lost connection issue” as there are many other similar topics.
My case is a bit different in that I frequently (once or twice a day) see the random error of “General error: 2013 Lost connection to MySQL server during query” on simple select with the limit statement on a 5-10 record table from our API log.
The strange thing is that the same API is called thousands of times/per day without any issues.

Could someone give me directions on how to debug/find the root cause of this?
Thanks in advance.

Debug/Info
Our PXC runs on Kubernetes (Vultr Cloud).
Our error logs show nothing alarming.
There is no restart information shown.
There seems to be no issue with network communication.
Our monitoring shows normal CPU/Memory usage.
The maximum connection also did not even reach half of the allowable value.

Usual Error
SQLSTATE[HY000]: General error: 9001 Max connect timeout reached while reaching hostgroup 10 after 10457ms (SQL: select * from x limit 1)
SQLSTATE[HY000]: General error: 2013 Lost connection to MySQL server during query (SQL: select DISTINCT with join)

Have you looked at ProxySQL’s logs? Looks like there’s a timeout occurring between proxysql and the backend mysql.

Hi @matthewb
Thank you for helping me.

I was able to manage to see the error logs from proxysql as the following:

00:00:31 MySQL_Monitor.cpp:7780:monitor_galera_process_ready_tasks(): [ERROR] Timeout on Galera health check for cluster1-pxc-0.X.X:3306 after 1229ms. If the server is overload, increase mysql-monitor_galera_healthcheck_timeout.

00:00:35 MySQL_Monitor.cpp:2214:monitor_galera_thread(): [ERROR] Error on Galera check for cluster1-pxc-0.X.X:3306 after 1001ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout. Error: timeout or error in creating new connection: Can't connect to MySQL server on 'Y.Y.Y.Y' (110)

00:00:36 MySQL_Session.cpp:1706:handler_again___status_PINGING_SERVER(): [ERROR] Ping timeout during ping on cluster1-pxc-0.X.X:3306 after 200245us (timeout 200ms)

Then, after 30 seconds or so, the cluster is online again.

What strange is that, if I tried to see the ping error logs at ProxySQL, I see no errors at all.

select * from mysql_server_ping_log;
> No error log shown (null)

If I tried to see the error logs at MySQL backend, the following is the only error I see:

select * from stats_mysql_errors limit 100;

> error: WSREP has not yet prepared node for application use

Could I know If the error is related to network issue at some particular moments, will increasing mysql-monitor_galera_healthcheck_timeout help in such a case?
Thanks in advance.

A ping timeout would be indications of a network issue. When ProxySQL is having this issue, can you connect directly to the mysql backend? If not, that’s an issue, then try connecting manually from the proxysql server to mysql backend and see if that has issues.

Hi @matthewb
Thank you for your help.
Because the issue only happened around 30 seconds at a time, it is really not possible for me to debug based on your suggestion.
But It seems like the issue is because our ProxySQL cannot reach MySQL’s backend at those points of time.

In the case of the network loss for 30 seconds or so like this, will adjusting mysql-monitor_galera_healthcheck_timeout help?
I am sorry I am very new to this. I am not sure what might be the complications if I adjust the health check timeout.

Please kindly give me any suggestions.

That’s not normal network behavior. Do you have a faulty switch? Are you using hostnames/DNS anywhere? NAT/Firewall device?

Set up fping on proxysql server to run in the background, pinging several machines, local and remote. If a remote always has ping uptime, but you see loss to local network, then something else is wrong.

@matthewb Thank you for your help.

Sorry for not being able to get back to you earlier.
Actually, we host our PXC on a cloud provider (no hostname, only k8s SVC names).
It seems that there are times that cause this issue due to network bottlenecks/high traffic loads at the cloud provider side.
That is why it’s random.

If the issue only happens for 30-40 seconds at that random time, would it be possible to increase the health check timeout for this matter?

Thank you in advance for your help.

Yes, you can increase the health check timeout for this purpose.