Async replication breaking after another a node joins the cluster

Hello,

I’m struggling trying to set up async replication between 2 xtradb 5.7.31 clusters. Each cluster has 3 nodes.

If i only start one node of the secondary cluster and start re plication, it works fine, but as soon as i start another node and it joins after performing an SST, the async replication breaks due to a duplicated key.

At first i thought it was because when it performs an SST, it also brings the slave information and it starts the slave in the new node, but i put on my.cnf the skip-slave-start option, but keeps failing.

I have tried also to start the cluster in RO just to be sure nobody else writes, but i still have the same problem.

Last thing i have tried is setup the IGNORE_SERVER_IDS on change master query, but still keeps failing.

the change master query is this one

CHANGE MASTER TO
MASTER_HOST = “MY_SERVER_IP”,
MASTER_PORT = 3306,
MASTER_USER = “replic”,
MASTER_PASSWORD = “what_a_passwd”,
MASTER_AUTO_POSITION = 1, # to auto-position using GTID on slave
IGNORE_SERVER_IDS = (1,2,3);

for the records, both clusters have GTID enabled

i guess i’m missing something, but i cannot see what. Any ideas / suggestions are more than welcomed,

thanks in advance,
Adrián

1 Like

Hi @acampoh , thanks for posting to the Forums :slight_smile:

I have used @yves.trudeau wonderful Replication Manager between two PXC 5.7 clusters

In fact there is a section on ensuring SSTs are done as well as steps to ensure there is only one GTID sequence in the cluster.

Best of luck,

2 Likes

Just to understand, when the 2nd node joins the secondary cluster, ‘SHOW SLAVE STATUS’ actually reports information?

Yes, it reports all the info about the replication on that node

1 Like

Ok. Most likely because the SST process is using xtrabackup which clones the entire $DATADIR which would include the master.info file. It is surprising that skip-slave-start is not having any affect. That would seem to be a bug in base MySQL. Can you try setting server_id=0 in the joining node along with s-s-s? 0 means “cannot be slave nor master”.

1 Like

Hello,

Just found the issue. It wasn’t fully related with replication, it was more with the startup process of the 2nd and 3rd nodes of the cluster.

My mistake was to think that having the first node as RO, the other nodes when they do sst and join were going to inherit the RO variable too, but they didn’t. As we have sentry running there, it was writting and creating a duplicated key error everytime one of these nodes started.

I fixed it once setting the read_only option in my.cnf.

thanks a lot!

1 Like