Pt-table-sync MySQL server has gone away

Using pt-table-sync I keep getting MySQL server has gone away [for Statement “SET NAMES ‘utf8mb4’”] at line 10755 while doing dbname.large_table_name on hostname.

Any ideas appreciated. Thanks!

I am using percona-toolkit version. percona-toolkit.x86_64 0:3.5.7-1.el7

The master server is mysql version 8.0.21-12
The slave server is mysql version 8.0.32-24
Do they need to be the same or a similar version?

The pt-table-sync statement being executed on the master is:
pt-table-sync --verbose --sync-to-master h=remote_server_name,u=username,p=passwd,P=port --tables large_table_name --execute

wait_timeout = 259200 seconds which is 3 days.

Replication is not behind.
Show Processlist displays “waiting for replica to advance” statement for a few seconds at a time, and then runs a select statement.

STATE: Waiting for the replica SQL thread to advance position
INFO: SELECT MASTER_POS_WAIT('bin-log.004755', 772425617, 60 )

INFO: SELECT /*server_name.large_table_name:32469/69782*/ 32468 AS chunk_num, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `field1`,`field2`,`field3`)) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `databasename`.`field1`,`field2`,`field3` FORCE INDEX (`PRIMARY`) WHERE (``field1_id` >= '65161270' AND `field2_id` < '65163277') LOCK IN SHARE MODE
1 row in set (0.00 sec)

The output is this info below. It appears it ran for about 9 hours before failing?

# DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE
#      0       0      0      0 Chunk     00:28:53 08:19:01 0    large_table_name
MySQL server has gone away [for Statement "SET NAMES 'utf8mb4'"] at line 10755 while doing dbname.large_table_name on servername

Hello @Ron

Please check mysql error log as well to see if it has any message for failing connection and mysql service was up and running at that time.

Hi @lalit.choudhary,
I did see an error in the mysql log which seems to be related to the issue.
I thought that the wait_timeout was set to a much larger value by the my.cnf config file, but it appears that pt-table-sync defaults to 10000 seconds so it eventually timed out because it was idle. I don’t understand why it detected mysql as being idle for over 2.5 hours though and then disconnected? Mysql seems to be “waiting for the replica sql thread to advance position" every few seconds and then runs a select statement? Maybe at some point it doesn’t advance its position for many hours?

I’m running pt-table-sync against a single table in this case instead of the whole database to try and narrow down where it’s failing. I’d assume that pt-table-sync would just process the table as usual?
Thanks!

The latest error when running pt-table-sync on the table was this. So even though the wait_timeout seemed to occur at 14:06 it wasn’t until 21:12 that pt-table-sync actually stopped.

# DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE
#      0       0      0      0 Chunk     13:20:52 21:12:03 0    db_name.table_name
Issuing rollback() due to DESTROY without explicit disconnect() of DBD::mysql::db handle ;host=db_hostname;port=xxxx;mysql_read_default_group=client at /bin/pt-table-sync line 10598.

Here is the mysql error log with the wait_timeout exceeded error. The [Note]'s about multi-threaded slaves aren’t an issue as far as I know?

2024-01-16T16:06:12.219230-08:00 2021 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel '': seconds elapsed = 124; events assigned = 3129105409; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 544941225792100 waited (count) when Workers occupied = 26223 waited when Workers occupied = 367963900
2024-01-16T16:07:33.120618-08:00 504338 [Note] [MY-013730] [Server] 'wait_timeout' period of 10000 seconds was exceeded for `my_username`@`%`. The idle time since last command was too long.
2024-01-16T16:07:33.149949-08:00 504338 [Note] [MY-010914] [Server] Aborted connection 504338 to db: 'unconnected' user: 'my_username' host: 'my_hostname' (The client was disconnected by the server because of inactivity.).
2024-01-16T16:08:12.179434-08:00 2021 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 3129255937; work
er queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 544975976349100 waited (count) when Workers occupied = 26223 waited when Workers occupied = 367963900

Hello @Ron

Thank you for the details.

This is an expected behavior, as mentioned in pt-table-sync — Percona Toolkit Documentation

–timeout-ok

Keep going if --wait fails.

If you specify --wait and the slave doesn’t catch up to the master’s position before the wait times out, the default behavior is to abort. This option makes the tool keep going anyway. Warning: if you are trying to get a consistent comparison between the two servers, you probably don’t want to keep going after a timeout.

As the above option use comes with Warning , I think you check why replication lags in 1st place and fix that so timeout will not happen.
there could be multiple reasons for replication lag most common are,

  • Single threaded replication thread can’t catchup with Source load.
  • IO, OS resources saturation
  • Missing Primary/Unique Key
  • etc.

Some references that might help:

Thanks Lalit that makes sense. I’ll look into those links you posted to see if I can figure out why replication is slow.
I am using multi-threaded replication and almost all tables have unique keys.

On the replica process list it seems to be spending a lot of time executing this query.
SELECT MASTER_POS_WAIT(‘bin-log.054986’, 427572618, 60 ) Query | Waiting for the replica SQL thread to advance position.