Multi-threaded replication deadlock

Hi all;

Hoping someone out there can help me solve an increasingly pressing issue.

Issue: Multi-threaded replication threads deadlock, causing the sql_thread to stop doing any work (but the io_thread keeps running). Cannot stop the sql_thread; have to kill the server and force a restart to continue. Thread status: Waiting for preceding transaction to commit

Environment:
Version: 8.0.26-17 Percona Server
OS: Ubuntu 18.04.5 LTS

Relevant replication settings:
binlog_format = row
rpl_semi_sync_replica_enabled = on
replica_parallel_workers = 16
replica_parallel_type = logical_clock
replica_preserve_commit_order = on

All tables are using InnoDB.

There are a lot of similar bugs/issues out there related to multi-threaded replica deadlocks, but below is one I found where the comments seem to be what we’re running into:
https://bugs.mysql.com/bug.php?id=95863

This MySQL bug has an associated Percona bug (PS-5641), which is marked as fixed in 8.0.23-14, but we are using a newer version than that and still having the issue. So either it’s a different case, or a regression (likely a different case).

I have limited access to query information since we are using row-based replication, but the simplest case I’ve seen was two replication threads inserting into the same table and same database. However in all the cases I’ve looked at, the queries are all over the place so have yet to see a common theme. I have seen this happen in cases where every replication thread was processing queries from different databases (same server, but different database/schema).

Any thoughts on where to go from here? Due to our workload we can’t disable preserve_order_commit, so that limits our options.

Thanks in advance.
Scott

1 Like

Hi Scott,

I have checked the bug report you mentioned.
One possible fix is updating to Percona Server 8.0.28 (released the 20th of June a few days back) which contains a fix for some related bug MySQL Bugs: #103636: Slave hangs with slave_preserve_commit_order On plus many other fixes for replication that might directly or indirectly be affecting you.

if you still hit the issue after upgrading to 8.0.28 , then please check if “Exec_Source_Log_Pos” keeps increasing while sql_thread seems stuck. If “Exec_Source_Log_Pos” keeps increasing It might happen that you see the message frequently because replication performance is degraded/locked or constantly getting blocked but not completely halted .

Let us know if this worked for you or if you have any other questions and we will gladly assist

1 Like

Hi Carlos;

Thanks for the response. We try to avoid upgrading as a hail marry for fixing an issue, but understand that can be the fix sometimes.

As far as if the replica is processing still, I always check with the below query:
select worker_id,thread_id, applying_transaction from performance_schema.replication_applier_status_by_worker order by applying_transaction;

With that I can see that the stuck worker threads (we have 16, but usually somewhere around 8-4 of them are stuck) will have that same transaction listed indefinitely, while the remaining worker threads sit empty. And I can see that the retrieved GTID set keeps increasing, but the executed GTID set never moves once it gets deadlocked.

Thanks,
Scott

1 Like

Hi again,

if upgrading on the short term is not possible, then likely the only fix is switching replica_parallel_type from “LOGICAL_CLOCK” to “DATABASE”. This will greatly decrease parallel replication efficiency and might make the replica lag behind but should not provoke the deadlock.

Is it possible to setup an extra replica with the new version so that at least you can check if upgrading does fix the issue?

Regards

1 Like

Hi Carlos;

We have many hundreds of databases, so this is a large scale project regardless of what we end up doing. I know that to use DATABASE, we have to disable replica_preserve_commit_order as well. Can you confirm if that has any affect on transaction ordering when using DATABASE? I can’t get a good read from the docs on that, as there are many caveats related to this area. But to your point, replication lag is/was a major issue, so likely any change that reduces replica throughput will be untenable. I realize this isn’t leaving you much room, just being straightforward with what we’re working with and appreciate your responses. =)

Thanks,
Scott

1 Like

Hi again.

You are right. you need to disable replica_preserve_commit_order in order to use replica_parallel_type=DATABASE.

In case of a crash or issues, replicated events might be applied out of order (for different databases) and lastly applied dos not ensure that any previous event was also applied. If you have GTID enabled then it can be easily fixed by restarting replication, but without GTID it can be complicated to resume operations. More info here MySQL :: MySQL 8.0 Reference Manual :: 17.5.1.34 Replication and Transaction Inconsistencies

So, you are limited on options. Either use GTID and replication_parallel_type = DATABASE, or parallel_slave-workers = 1 or upgrading.

Regards!

1 Like

Hi Carlos;

Thanks again. I also saw that replica_parallel_type=DATABASE is going to get deprecated, so was trying to avoid going that route since we would be right back where we started as soon as we were required to stop using that. And we can’t disable parallel workers due to replication lag unfortunately, though that would make my life easier, hah!

So I will go down the path of testing an upgrade and see how that goes. Will be a while before I can fully test a new version and upgrade all of our production databases, but I will try to report back once that is done we we’ve had enough time to see if it might have fixed it.

Thanks,
Scott

1 Like

Last comment.

Sadly MySQL 8 is introducing incompatible changes for minor patches so extensive testing is required.
If you are using xtrabackup keep in mind that you need to upgrade xtrabackup to 8.0.28 version as well.

Regards

2 Likes

Hello again Carlos!

Any chance you can poke someone to look at this bug I submitted for 8.0.28? I’m currently testing the upgrade to see if it’ll solve this issue, but seeing this new error and need to know if it should be of concern or not.

https://jira.percona.com/browse/PS-8313

Thanks!
Scott

1 Like