MySQL 8.0.24 / 8.0.25 connection problems with `FEDERATED` storage engine (Communication link failure)

Since latest Percona Server for MySQL 8.0.25 upgrade, I am struggling with severe connection issues between two servers. I could track this down to FEDERATED storage engine.

On application/client level, the sporadic errors appear like this:

Communication link failure: 1160 Got an error writing communication packets
Communication link failure: 1156 Got packets out of order

It’s hard to debug as those errors don’t appear in mysql.err, not even with log_error_verbosity = 3. Looks like internal connections from FEDERATED tables to server are not getting logged.

Those errors only pop up after a couple of hours. Initially the workaround was to restart MySQL on the host with FEDERATED tables (see below, host backend) every 2-3 hours, which made the connection dropping disappear.

Here’s my setup (all on Percona Server for MySQL 8.0.25):

  • Host backend: PHP 8.0 application (using PDO/mysqlnd) with MySQL triggers that copy data from main application db to second mailsync database. mailsync database only contains tables of storage engine FEDERATED that write data to remote host mail
  • Host mail: REPLICATION MASTER for database mailsync
  • Host mx1: 1st REPLICATION SLAVE / REPLICA for database mailsync
  • Host mx2: 2nd REPLICATION SLAVE / REPLICA for database mailsync

I have that rather complex setup running in production for 7 months now on Percona Server for MySQL 8.0, without any issues until and including MySQL 8.0.23. Data in FEDERATED tables is accessed more or less frequently (5-10 times/hour). Only the latest upgrade to 8.0.25 broke it. The connection issues appear immediately without running into any timeout, and also affect simple SELECT queries accessing very little data.

I suspect this has something to do with the newly introduced connection management in MySQL 8.0.24:

https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-24.html#mysqld-8-0-24-connection-management

Connection Management Notes

Previously, if a client did not use the connection to the server within the period specified by the wait_timeout system variable and the server closed the connection, the client received no notification of the reason. Typically, the client would see Lost connection to MySQL server during query (CR_SERVER_LOST) or MySQL server has gone away (CR_SERVER_GONE_ERROR).

In such cases, the server now writes the reason to the connection before closing it, and client receives a more informative error message, The client was disconnected by the server because of inactivity. See wait_timeout and interactive_timeout for configuring this behavior. (ER_CLIENT_INTERACTION_TIMEOUT).

The previous behavior still applies for client connections to older servers and connections to the server by older clients.

As I have that same setup of 4 related MySQL servers running in production in two different companies (my own webhosting company and a similar scaled hosting company I manage the infrastructure for), I now have a great way I can prove it is really MySQL > 8.0.23 and FEDERATED storage engine related. I have downgraded the 4 servers (by full data dump and fresh MySQL reinstall / reloading all data) of one company to Percona Server for MySQL 8.0.23 - the problem no longer pops up!

On MySQL 8.0.25 I have also found a workaround to make the problem (nearly… 2 days of testing is not enough yet) disappear:

[mysqld]
interactive_timeout     = 86400
wait_timeout            = 86400

Raising the wait_timeout on all 3 hosts (mainly mail as FEDERATED table “server”, but I also raised those values on MySQL replication slaves, which was probably not needed) from default 8h to 24h is currently my best workaround. Before I tried to lower wait_timeout to 1h, and indeed the connection issues got much more frequent, 1+ hrs after restarting MySQL. So it looks like MySQL internally keeps track of its FEDERATED connections to the remote server and once the server with drops inactive connections after wait_timeout, FEDERATED still tries to re-use them and struggles with the new MySQL 8.0.24+ Connection Management. In my eyes, FEDERATED should silently try to reconnect in case a previous server connection got dropped due to a timeout, as it was before.

Can you tell if this is a known bug that was possibly already fixed in latest MySQL 8.0.26 or if I am the first one reporting those issues?

Thanks, Philip

2 Likes

Hi, same issue here. Are you sure that this is connected to the MySLQ update to > 8.0.24?

On Debian there were very long timeout settings by default. After changing them we also started to get these error messages on federated tables.

1 Like

this bug was identified and fixe in 8.0.28 [PS-7999] FEDERATED engine not reconnecting on wait_timeout exceeded - Percona JIRA

2 Likes

very nice, thanks for the link! Didn’t occur for quite some time, so I can confirm it was fixed in MySQL 8.0.28.

I have now reverted connect_timeout and interactive_timeout back from 24h to its defaults (8h). That was my workaround which probably never had a real effect.

1 Like

While this bug never popped up in MySQL 8.0.28 and 8.0.29, it now starts again bugging us since having upgraded to Percona Server for MySQL 8.0.30 ( 8.0.30-22, tested on Debian Bullseye)

so, it looks like there is a regression for [PS-7999] FEDERATED engine not reconnecting on wait_timeout exceeded in MySQL 8.0.30.
@Kamil_Holubicki Can you please reopen that issue?

1 Like

Hello @onlime
Thank you for the info.
Could you please create a new Jira ticket https://jira.percona.com/projects/PXC?
Please provide detailed steps to allow us to reproduce the problem.

If possible, could you please try MySQL 8.0.31 (Oracle) and confirm you see the problem as well?

1 Like

Hi @Kamil_Holubicki @onlime
is there any update on this issue?

We’re using the latest version 8.0.32-2 on Debian 11 and the error still occurs.

Hey @bitone – Last time I tested this in production on Percona for MySQL 8.0.31 and the problem still occurred. Again, I had to revert back to my workaround which does the trick (potentially the problem could still occur, but didn’t during the last 2 months with that configuration), raising those timeouts from default 8h to 24h:

interactive_timeout: 86400
wait_timeout: 86400

so far, this also works on 8.0.32 for more than a month.

But yes, this issue should be reopened and addressed. @Arnaud ?

1 Like

@onlime Thank you for your quick update!

We tried this solution in the past. But with quite a lot of users we had 10+ connections more each day. So I wrote a script which reads a table of all sleeping threads and kills the ones who were waiting for more than a few hours… Not nice but at least it reduces the problem to an acceptable level.

I did the following test:
2 Percona Servers, both 8.0.32

Server 2:
config:

port=5000
wait_timeout=4
interactive_timeout=4
create database test;
create table test.f (id int primary key);
insert into test.f values(1);

Server 1:
config:

federated=1
create database test;
create server test foreign data wrapper test1 options(user 'root', password '', host '127.0.0.1', port 5000, database 'test');
create table test.f (id int primary key) engine=federated connection='test';

mysql> select * from test.f;
+----+
| id |
+----+
|  1 |
+----+
1 row in set (0,02 sec)

//wait more than 4 seconds

mysql> select * from test.f;
+----+
| id |
+----+
|  1 |
+----+
1 row in set (0,01 sec)

I debugged it a little and reconnection is done in client.cc::cli_advanced_command()

Looks like everything works as expected.

Could you please try the above and post the result?

Sorry I didn’t have time yet to test this.
This is how it looks on our server right now:
image

And that’s exactly how it looked before the fix from version 8.0.28… But maybe it’s another issue this time? I wonder why the Fed. engine keeps the connections open but doesn’t reuse them…

@Kamil_Holubicki
Finally I did as you said. And this is the result:

ERROR 1156 (08S01): Got packets out of order
mysql>  select * from f;
+----+
| id |
+----+
|  1 |
+----+
1 row in set (0.02 sec)

mysql>  select * from f;
+----+
| id |
+----+
|  1 |
+----+
1 row in set (0.01 sec)

mysql>  select * from f;
+----+
| id |
+----+
|  1 |
+----+
1 row in set (0.01 sec)

mysql>  select * from f;
+----+
| id |
+----+
|  1 |
+----+
1 row in set (0.01 sec)

mysql>  select * from f;
ERROR 1156 (08S01): Got packets out of order

So when I wait for > 4 seconds, it seems fine. But when I do it in random intervals, I get the “Got packets out of order” error quite frequently.

Something must have changed again to the bad in the past few releases…

It’s quite obvious in our server log from a CMS that uses some synchronized data of a remote federated DB. After the patch in 8.0.28 these errors did totally disappear. And now they came back and we didn’t change anything except doing the regular apt update + upgrade.

It seams that I did not do the test well. I probably waited to short.
I repeated the test and now the error appears each time after waiting 5 or 6 seconds:

mysql>  select * from f;
+----+
| id |
+----+
|  1 |
+----+
1 row in set (0.01 sec)

mysql>  select * from f;
ERROR 1156 (08S01): Got packets out of order
mysql>  select * from f;
+----+
| id |
+----+
|  1 |
+----+
1 row in set (0.02 sec)

mysql>  select * from f;
ERROR 1156 (08S01): Got packets out of order
mysql>  select * from f;
+----+
| id |
+----+
|  1 |
+----+
1 row in set (0.02 sec)

mysql>  select * from f;
ERROR 1156 (08S01): Got packets out of order

It seems to me that the federated engine expects that the connection is still there but it isn’t.

Well, great :roll_eyes:

I did some digging in the Git history.

Here the error was fixed: PS-7999: FEDERATED engine not reconnecting on wait_timeout exceeded · percona/percona-server@9b9cfc9 · GitHub

And here the fix was reverted (on Nov 24, 2022): Revert "PS-7999: FEDERATED engine not reconnecting on wait_timeout ex… · percona/percona-server@8a44ea6 · GitHub

Why?
After doing some more reading: It say in Percona Git:

This reverts commit 9b9cfc9 in favour of upstream mysql/mysql-server@70ef8034a

But the this upstream fix was never applied to Percona ha_federated.cc

Mai 2nd: Opened bug report [PS-8747] FEDERATED engine not handling wait_timeout MySQL Bug #33500956 - Percona JIRA

Thank you for opening the ticket in Jira. Let’s continue the investigation there.

Good news: This will be fixed in the pending release 8.0.33-25!

Thank you @Kamil_Holubicki

1 Like