Replication lag issue between two PXC clusters with status message "Waiting for dependent transaction to commit" on replica

We have a client who is running three, three node PXC clusters (nine total nodes) in OpenShift using operator version 8.0.29-21.1. In this configuration, the primary cluster is replicating to the other two clusters simultaneously. Similar to the following diagram below.

An alert was received this morning that replication was lagging behind the primary on one of the replicas. I went to the replica in question and executed the show replica status command and noted the Replica_SQL_Running_State has a value of “Waiting for dependent transaction to commit” (see below).

mysql> show replica status\G
*************************** 1. row ***************************
Replica_IO_State: Waiting for source to send event
Source_Host: a6bd61dd0a27847d58378a12ad3f02f5-6e281c2d79690133.elb.us-west-2.amazonaws.com
Source_User: replication
Source_Port: 3306
Connect_Retry: 60
Source_Log_File: binlog.003030
Read_Source_Log_Pos: 13597329
Relay_Log_File: wordpress-db-pxc-0-relay-bin-awspxc1_to_vmwarepxc1.000016
Relay_Log_Pos: 7350063
Relay_Source_Log_File: binlog.003030
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Source_Log_Pos: 7349853
Relay_Log_Space: 13597868
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Source_SSL_Allowed: No
Source_SSL_CA_File:
Source_SSL_CA_Path:
Source_SSL_Cert:
Source_SSL_Cipher:
Source_SSL_Key:
Seconds_Behind_Source: 20965
Source_SSL_Verify_Server_Cert: Yes
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Source_Server_Id: 41819022
Source_UUID: cd443a8a-b942-11ed-8612-0a580a210614
Source_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Replica_SQL_Running_State: Waiting for dependent transaction to commit
Source_Retry_Count: 5
Source_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Source_SSL_Crl:
Source_SSL_Crlpath:
Retrieved_Gtid_Set: 6235c032-b942-11ed-863e-865d3b297558:1870258-1907832
Executed_Gtid_Set: 148dd969-d3c3-11ed-815e-0a580a80026b:1-12,
16f0a566-d3c3-11ed-9642-ee92fdae9413:1-842,
6235c032-b942-11ed-863e-865d3b297558:1-1898094
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name: awspxc1_to_vmwarepxc1
Source_TLS_Version:
Source_public_key_path:
Get_Source_public_key: 0
Network_Namespace:

All nodes in all clusters are using the same configuration.

I checked the replica on the other cluster and it’s not having any replication lag issues. I then checked the replica_parallel_type and replica_parallel_workers variables and noted the parallel workers variable is set to the default value of 4.

This replica has been like this for a while now and it’s behaving like it’s in a deadlock situation. Therefore, I’m trying to determine how to fix this without losing any data.

Does anybody have any recommendations for resolving this issue?

Thanks!

What happens when you restart the replication thread?

stop replica;start replica;

Is the thread still in a deadlock situation, or do you see Exec_Source_Log_Pos moving?
Trying to understand that it is continuously in deadlock situation or intermittently and keep resolving.

On the original replica, for which I created this posting, nothing happened when we stopped the replica, it just hangs. We ended up rebuilding the replica.

Yesterday, I noticed the other replica is doing the same thing and has been in this state for at least a day. The exec_source_log_pos is not moving. The Retrieved_Gtid_Set keeps incrementing, but the Executed_Gtid_Set hasn’t changed. The Replica_SQL_Running_State remains “Waiting for dependent transaction to commit”

Hi,
The “Waiting for dependent transaction to commit” is caused by transaction waiting for other transaction in the group commit to complete before they can commit. This is caused by “replica_preserve_commit_order = on”. Transactions within a group commit are independent and could, at least in theory, commit in a different order. With PXC, since each transaction must be acknowledge by all cluster nodes, removing that constraint may help.

2 Likes

Yes, setting “replica_preserve_commit_order = off” took care of the issue.

Thanks!