Percona 5.7.17-11-1 - Binlog Dump GTID leak

Hi All,
I’m running a master/slave setup using Percona 5.7.17-11-1 with GTID + semi-sync replication enabled. More or less once a week the thread pool for the replication user on the master, is overloaded by Binlog Dump GTID threads in Waiting to finalize termination or starting state. Unfortunately I still don’t have a repro but this is what I see when the issue happens:

  • semi-sync repl fails for a replica & semi-sync binlog_dump is stopped on master

2017-06-19T10:12:58.501501Z 864 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2017-06-19T10:12:58.501533Z 864 [Note] Stop semi-sync binlog_dump to slave (server_id: 1019420603)

  • slave reconnect to the master starting a new binlog_dump thread and while initializing a new dump thread for the slave, the master find a zombie dump thread with the same UUID

2017-06-19T10:14:48.993726Z 84362 [Note] While initializing dump thread for slave with UUID <787d319f-33db-11e7-9c43-122ca6236436>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(864).
2017-06-19T10:14:48.993888Z 84362 [Note] Start binlog_dump to master_thread_id(84362) slave_server(1019420603), pos(, 4)

  • Binlog Dump GTID thread is hanging forever keeping a connection

Here’s the full processlist after several iterations of the above


Id User Host db Command Time State Info Rows_sent Rows_examined
23738 vt_repl_a 10.3.55.166:43566 NULL Binlog Dump GTID 613992 Master has sent all binlog to slave; waiting for more updates NULL 0 0
84394 vt_repl_a 10.3.197.55:59470 NULL Binlog Dump GTID 17998 Waiting to finalize termination NULL 0 0
84441 vt_repl_a 10.3.197.55:60020 NULL Binlog Dump GTID 17718 Waiting to finalize termination NULL 0 0
84521 vt_repl_a 10.3.197.55:60938 NULL Binlog Dump GTID 17228 Waiting to finalize termination NULL 0 0
84592 vt_repl_a 10.3.197.55:61730 NULL Binlog Dump GTID 16807 Waiting to finalize termination NULL 0 0
84630 vt_repl_a 10.3.197.55:62160 NULL Binlog Dump GTID 16597 Waiting to finalize termination NULL 0 0
84641 vt_repl_a 10.3.197.55:62282 NULL Binlog Dump GTID 16527 Waiting to finalize termination NULL 0 0
84651 vt_repl_a 10.3.197.55:62406 NULL Binlog Dump GTID 16457 Waiting to finalize termination NULL 0 0
84663 vt_repl_a 10.3.197.55:62526 NULL Binlog Dump GTID 16387 Waiting to finalize termination NULL 0 0
84675 vt_repl_a 10.3.197.55:62654 NULL Binlog Dump GTID 16317 Waiting to finalize termination NULL 0 0
84685 vt_repl_a 10.3.197.55:62774 NULL Binlog Dump GTID 16247 Waiting to finalize termination NULL 0 0
84747 vt_repl_a 10.3.197.55:63450 NULL Binlog Dump GTID 15896 Waiting to finalize termination NULL 0 0
84784 vt_repl_a 10.3.197.55:63878 NULL Binlog Dump GTID 15686 Waiting to finalize termination NULL 0 0
84895 vt_repl_a 10.3.197.55:65166 NULL Binlog Dump GTID 14985 Waiting to finalize termination NULL 0 0
84911 vt_repl_a 10.3.197.55:65346 NULL Binlog Dump GTID 14915 Waiting to finalize termination NULL 0 0
85923 vt_repl_a 10.3.197.55:20114 NULL Binlog Dump GTID 10883 Waiting to finalize termination NULL 0 0
85934 vt_repl_a 10.3.197.55:20236 NULL Binlog Dump GTID 10813 Waiting to finalize termination NULL 0 0
86220 vt_repl_a 10.3.197.55:22682 NULL Binlog Dump GTID 9622 Waiting to finalize termination NULL 0 0
86231 vt_repl_a 10.3.197.55:22800 NULL Binlog Dump GTID 9552 Waiting to finalize termination NULL 0 0
87098 vt_repl_a 10.3.197.55:29746 NULL Binlog Dump GTID 6181 starting NULL 0 0
87109 vt_repl_a 10.3.197.55:29868 NULL Binlog Dump GTID 6111 starting NULL 0 0
87125 vt_repl_a 10.3.197.55:30046 NULL Binlog Dump GTID 6040 starting NULL 0 0
87135 vt_repl_a 10.3.197.55:30164 NULL Binlog Dump GTID 5970 starting NULL 0 0
87146 vt_repl_a 10.3.197.55:30288 NULL Binlog Dump GTID 5900 starting NULL 0 0
87158 vt_repl_a 10.3.197.55:30416 NULL Binlog Dump GTID 5830 starting NULL 0 0
87168 vt_repl_a 10.3.197.55:30538 NULL Binlog Dump GTID 5760 starting NULL 0 0
87179 vt_repl_a 10.3.197.55:30660 NULL Binlog Dump GTID 5690 starting NULL 0 0
87195 vt_repl_a 10.3.197.55:30838 NULL Binlog Dump GTID 5620 starting NULL 0 0
87205 vt_repl_a 10.3.197.55:30968 NULL Binlog Dump GTID 5550 starting NULL 0 0
87216 vt_repl_a 10.3.197.55:31088 NULL Binlog Dump GTID 5480 starting NULL 0 0
87227 vt_repl_a 10.3.197.55:31212 NULL Binlog Dump GTID 5410 starting NULL 0 0
87237 vt_repl_a 10.3.197.55:31332 NULL Binlog Dump GTID 5340 starting NULL 0 0
87248 vt_repl_a 10.3.197.55:31460 NULL Binlog Dump GTID 5270 starting NULL 0 0
87264 vt_repl_a 10.3.197.55:31640 NULL Binlog Dump GTID 5200 starting NULL 0 0
87274 vt_repl_a 10.3.197.55:31762 NULL Binlog Dump GTID 5130 Waiting to finalize termination NULL 0 0
87285 vt_repl_a 10.3.197.55:31884 NULL Binlog Dump GTID 5060 starting NULL 0 0
87296 vt_repl_a 10.3.197.55:32006 NULL Binlog Dump GTID 4990 starting NULL 0 0
87306 vt_repl_a 10.3.197.55:32134 NULL Binlog Dump GTID 4919 starting NULL 0 0
87317 vt_repl_a 10.3.197.55:32254 NULL Binlog Dump GTID 4849 starting NULL 0 0
87333 vt_repl_a 10.3.197.55:32434 NULL Binlog Dump GTID 4779 starting NULL 0 0
87343 vt_repl_a 10.3.197.55:32558 NULL Binlog Dump GTID 4709 Waiting to finalize termination NULL 0 0
87354 vt_repl_a 10.3.197.55:32682 NULL Binlog Dump GTID 4639 starting NULL 0 0
87365 vt_repl_a 10.3.197.55:32806 NULL Binlog Dump GTID 4569 starting NULL 0 0
87375 vt_repl_a 10.3.197.55:32932 NULL Binlog Dump GTID 4499 starting NULL 0 0
87386 vt_repl_a 10.3.197.55:33056 NULL Binlog Dump GTID 4429 starting NULL 0 0
87402 vt_repl_a 10.3.197.55:33240 NULL Binlog Dump GTID 4359 starting NULL 0 0
87413 vt_repl_a 10.3.197.55:33362 NULL Binlog Dump GTID 4289 starting NULL 0 0
87423 vt_repl_a 10.3.197.55:33484 NULL Binlog Dump GTID 4219 starting NULL 0 0
87434 vt_repl_a 10.3.197.55:33606 NULL Binlog Dump GTID 4149 starting NULL 0 0

I forgot to mention that I’m using slave_net_timeout set to 60s (default for MySQL > 5.7.7). Increasing the value might help to avoid the issue but it’s still not clear to me why there are so many threads in “starting” or “Waiting to finalize termination” state that are not cleaned up by MySQL.

It seems that MySQL is not evaluating if there are “multiple” zombie dump threads. It’s currently trying to cleanup (without success) only the first one in the processlist:


2017-06-20T08:17:32.082995Z 100072 [Note] While initializing dump thread for slave with UUID <fdf82f3e-550b-11e7-a20f-122ca6236436>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(99436).
2017-06-20T08:27:12.382643Z 100214 [Note] While initializing dump thread for slave with UUID <fdf82f3e-550b-11e7-a20f-122ca6236436>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(99436).
2017-06-20T08:37:22.692025Z 100369 [Note] While initializing dump thread for slave with UUID <fdf82f3e-550b-11e7-a20f-122ca6236436>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(99436).
2017-06-20T08:57:33.244768Z 100683 [Note] While initializing dump thread for slave with UUID <fdf82f3e-550b-11e7-a20f-122ca6236436>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(99436).
2017-06-20T09:07:13.588993Z 100820 [Note] While initializing dump thread for slave with UUID <fdf82f3e-550b-11e7-a20f-122ca6236436>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(99436).
2017-06-20T09:22:44.021040Z 101063 [Note] While initializing dump thread for slave with UUID <fdf82f3e-550b-11e7-a20f-122ca6236436>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(99436).

I am also hitting same issue. The replication to once of the slaves is fully stuck due to this problem. I have also raised a bug on mysql.https://bugs.mysql.com/bug.php?id=90857&thanks=2&notify=71
Have you found any workarounds or solutions so far ?

Thanks!