Master-master asynchronous replication issue between two 5.6.24 PXC clusters

To meet the DR requirements for production, we have set up bidirectional master - master asynchronous mysql replication between two 3 node PXC clusters. binlog_format=ROW and log-slave-updates are set on each node.

The Server version: 5.6.24-72.2-56-log Percona XtraDB Cluster (GPL), Release rel72.2, Revision 1, WSREP version 25.11, wsrep_25.11

The asynchronous slave runs on node 3 in each cluster.

For instance:

cluster a is the slave of cluster b. The slave runs on node a3.
cluster b is the slave of cluster a. The slave runs on node b3.

On cluster a, if a transaction is executed on node a3, where the asynchronous slave is running, it is replicated to a1 and a2 and cluster b without error. However, if a transaction is executed on a1 or a2, where no asynchronous slave is running, the transaction is replicated to the rest of nodes in cluster a and cluster b. But the asyncrhonous slave on a3 is stopped trying to replicate the same transaction again.

The same behavior is observedd on cluster b. If a transaction is executed on b1 or b2, the salve on b3 will stop failing to apply the duplicate transaction.

Is master - master asynchronous mysql replication supported between two PXC clusters?

What can we do to prevent the recursive replication?

Here are the examples of the slave errors:

2015-07-07 20:22:12 11098 [ERROR] Slave SQL: Error ‘Table ‘test’ already exists’ on query. Default database: ‘spxna’. Query: ‘create table test ( i int unsigned not null auto_increment primary key, j char(32))’, Error_code: 1050
2015-07-07 20:22:12 11098 [Warning] Slave: Table ‘test’ already exists Error_code: 1050
2015-07-07 20:22:12 11098 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with “SLAVE START”. We stopped at log ‘spxus-slcprdmagdb03-master-bin.000002’ position 527

2015-07-07 20:39:10 12272 [ERROR] Slave SQL: Could not execute Write_rows event on table spxna.test; Duplicate entry ‘1’ for key ‘PRIMARY’, Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event’s master log spxus-slcprdmagdb03-master-bin.000003, end_log_pos 569, Error_code: 1062
2015-07-07 20:39:10 12272 [Warning] Slave: Duplicate entry ‘1’ for key ‘PRIMARY’ Error_code: 1062
2015-07-07 20:39:10 12272 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with “SLAVE START”. We stopped at log ‘spxus-slcprdmagdb03-master-bin.000003’ position 408

Thanks.

So you are not using MySQL’s GTIDs, which would prevent such issue.
I think the problem is how the server_id variables are set across the nodes. I’d suggest to set server_id to one value on cluster A and different value on cluster B. All 3 nodes in the same cluster may have the same id.

Thanks a lot! Both solutions worked!

For 5.6, I am wondering if it is recommended to use GTID versus the log file position. If GTID is enabled, do we only need to run change master_host=new node in the cluster to switch the async slave to a different node in the cluster in the event of master node failure?
Is it allowed to use VIP for the master_host?

Thanks again.