We have a client who is running three, three node PXC clusters (nine total nodes) in OpenShift using operator version 8.0.29-21.1. In this configuration, the primary cluster is replicating to the other two clusters simultaneously. Similar to the following diagram below.
An alert was received this morning that replication was lagging behind the primary on one of the replicas. I went to the replica in question and executed the show replica status command and noted the Replica_SQL_Running_State has a value of “Waiting for dependent transaction to commit” (see below).
mysql> show replica status\G
*************************** 1. row ***************************
Replica_IO_State: Waiting for source to send event
Source_Host: a6bd61dd0a27847d58378a12ad3f02f5-6e281c2d79690133.elb.us-west-2.amazonaws.com
Source_User: replication
Source_Port: 3306
Connect_Retry: 60
Source_Log_File: binlog.003030
Read_Source_Log_Pos: 13597329
Relay_Log_File: wordpress-db-pxc-0-relay-bin-awspxc1_to_vmwarepxc1.000016
Relay_Log_Pos: 7350063
Relay_Source_Log_File: binlog.003030
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Source_Log_Pos: 7349853
Relay_Log_Space: 13597868
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Source_SSL_Allowed: No
Source_SSL_CA_File:
Source_SSL_CA_Path:
Source_SSL_Cert:
Source_SSL_Cipher:
Source_SSL_Key:
Seconds_Behind_Source: 20965
Source_SSL_Verify_Server_Cert: Yes
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Source_Server_Id: 41819022
Source_UUID: cd443a8a-b942-11ed-8612-0a580a210614
Source_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Replica_SQL_Running_State: Waiting for dependent transaction to commit
Source_Retry_Count: 5
Source_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Source_SSL_Crl:
Source_SSL_Crlpath:
Retrieved_Gtid_Set: 6235c032-b942-11ed-863e-865d3b297558:1870258-1907832
Executed_Gtid_Set: 148dd969-d3c3-11ed-815e-0a580a80026b:1-12,
16f0a566-d3c3-11ed-9642-ee92fdae9413:1-842,
6235c032-b942-11ed-863e-865d3b297558:1-1898094
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name: awspxc1_to_vmwarepxc1
Source_TLS_Version:
Source_public_key_path:
Get_Source_public_key: 0
Network_Namespace:
All nodes in all clusters are using the same configuration.
I checked the replica on the other cluster and it’s not having any replication lag issues. I then checked the replica_parallel_type and replica_parallel_workers variables and noted the parallel workers variable is set to the default value of 4.
This replica has been like this for a while now and it’s behaving like it’s in a deadlock situation. Therefore, I’m trying to determine how to fix this without losing any data.