The problem gets worse by day. Today we had the first complaints from a client.
But the monitoring of server resources still doesn’t show anything noticeable that correlates with it.
Now we also had something in the mysql/error.log for the first time:
2024-05-24T11:38:24.927349Z 0 [ERROR] [MY-013129] [Server] A message intended for a client cannot be sent there as no client-session is attached. Therefore, we’re sending the information to the error-log instead: MY-001160 - Got an error writing communication packets
2024-05-24T11:38:24.927380Z 0 [ERROR] [MY-013129] [Server] A message intended for a client cannot be sent there as no client-session is attached. Therefore, we’re sending the information to the error-log instead: MY-001156 - Got packets out of order
2024-05-24T11:38:24.927484Z 0 [ERROR] [MY-013129] [Server] A message intended for a client cannot be sent there as no client-session is attached. Therefore, we’re sending the information to the error-log instead: MY-001160 - Got an error writing communication packets
2024-05-24T11:38:24.927501Z 0 [ERROR] [MY-013129] [Server] A message intended for a client cannot be sent there as no client-session is attached. Therefore, we’re sending the information to the error-log instead: MY-001156 - Got packets out of order
I didn’t count them but it must be a few hundreds in a single minute.
And again we have an accumulation of these errors in the CMS log. But they don’t correlate with the MySQL error log:
What has changed recently in your environment? As you said above, things were working just fine. Then suddenly things stopped working. That doesn’t simply happen without other factors. All of the errors we see are client related. Have you recently upgraded any libraries, or other components of your code, or infrastructure? Even something innocuous, like zlib, or openssl, could be causing issues, even for socket-based connections as sockets still use the mysql client library and underlying libs.
On this (physical, not virtual) server we are still using Debian 10 with LTS. We do the regular package updates with apt. All from official sources, including Percona repository.
So IMHO it’s nothing special about this server besides the fact that the Linux distribution is a bit old.
We use the same Percona MySQL version 8.0.36-28 on a few other productive servers (virtual instances) with Debian 11. No issues there.
Then we have one Debian 10 VM for testing where we didn’t do an update for a wile. It has Percona MySQL 8.0.34-26. No issues there either.
I compared the loaded libraries on both Deb 10 server and there’s no difference:
I also checked some of the package versions. libssl is exactly the same. libc is 2.28-10+deb10u2 (VM) vs 2.28-10+deb10u3 (physical). libstdc++ is the same.
I will now update the Deb 10 VM to the same versions as the physical machine. Let’s see if we start to get the same issues afterwards.
I had running the two machines, one physical productive and one virtual test server, for over a month now with the exact same distribution and packet versions installed.
The productive server continues to have these communication packet issues at least once a day. On the virtual machine there were no such issues at all.
I really can’t imagine a reason why it would have issues on a physical server while on a VM it’s running totally fine…
IMHO there’s only one significant difference between both servers, and that’s the load.
It’s hard to reproduce the real life load of a productive server on a test machine…
But then I also discovered an old issue with FEDERATED engine, which we use on those servers.
I already reported and helped to fixe an issue with this a while ago. Now it looks like it’s back!
I’m not sure yet if it’s the exact same problem and if the issue we have here is related.
But the FEDERATED engine opens a lot of remote connections (TCP) and leaves them open until the remote server closes them. With the default connection timeout of 8 hours this can lead to a lot of waiting TCP sockets.
Maybe this hits some internal limits of MySQL daemon?
Unfortunately, I can’t just disable FEDERATED on the production server to see if the problem goes away.
Therefore I’m still trying to reproduce these errors on the test server.
Things didn’t get better.
The same error over and over again, in every DB client: CMS, phpMyAdmin, mysqldump… But only on this bare-metal server.
Until a few days ago, when I did two things:
I did a complete reinstall of the Percona packages, but with telemetry disabled.
And the other thing I did was disabling the Zabbix scripts for MySQL.
Since that day, the problem has disappeared.
The Zabbix surveillance has always been there. Also on other severs (VMs) which don’t have the issue.
Is it really the telemetry module that induces these problems?!
I will wait a few more days and then re-enable the Zabbix scripts.
It’s been quite a while… And things got even worse.
The DB is mostly unusable. Now I get those warnings almost every minute. So everything has to be done at least twice to be successful.
In the meantime I have upgraded Debian to Bullseye. But the issue remains.
But I finally figured out what the difference is between this physical server and the VMs with exactly the same OS and package versions:
The VMs are shut down once a week for backup. The bare-metal server, however, is rarely restarted.
And indeed: After restarting the MySQL daemon on the bare-metal, the DB works for several days without those writing errors.
In my opinion, this means that something is escalating over time. Some resources may not be released, in this case most likely socket handles.
For this reason I started monitoring open sockets and system call errors. I also looked in the Percona source code for the place where system calls are made to sockets.
But I think this thread is already too long and too old and nobody reads it anymore. So I’ll either open an issue on Github or start a new thread. Or both.