Slave replication failure after full SST on master

Hello,

I have a set up consisting of a 3 node PXC cluster with 5.7.33, let’s call the nodes my01, my02, my03. The cluster itself is single master, using ProxySQL to direct writes to one single server. Then, my03 is replicating to a fourth single MySQL in a different location using MySQL GTID based replication. Let’s call this fourth server myreplica01.

This setup works perfectly until an SST is done to my03 for some reason. At this point, the replication stops working and I have to configure it again from scratch (full backup on my03, move the backup to myreplica01, restore, stat replication).

Slave status shows something like this:

Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'f2d83a3e-570b-ee18-5d2e-596888ce4869:1016331531' at master log mysql-bin.008521, end_log_pos 197880. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

This behavior makes some sense, as the bin-logs on my03 are reset by the SST, but given that the database itself is the same, is there a way to keep master-slave replication working even after SST?
Is there a faster way to restore replication without setting it up from scratch and without losing any transaction/data?

Thanks a lot

Hi, the way to accomplish this would be enabling GTID. This way if the replication source has an issue, you can easily repoint the replica to a different source. See How to setup a PXC cluster with GTIDs for details
Beginning with MySQL 8.0.22 you can also configure automatic connection failover, so no manual repointing will be needed anymore.

Configure server_uuid for each PXC node, instead of letting MySQL pick one at random each time an SST happens. This should keep the GTID set the same, even after restore.

The better question to be asking is “why is my03 executing an SST?” That should not be happening under normal operations. If my03 goes down and back up, it should IST first. If your IST isn’t working, check your galera gcache size.

The value of server_uuid is placed in $datadir/auto.cnf and that file is removed by SST, so MySQL creates a new server_uuid when it starts after the SST. Don’t know how to set a fixed server_uuid.

This time the SST happened because I was messing with compressor/decompressor options and forgot to install zstd on this node :frowning: IST is used everytime I had to reboot this node, so state transfer is working fine.

You can set server_uuid in the my.cnf

Seems like no, I can’t set it. At least on Ubuntu 22.04 with Percona MySQL 5.7.43. I may be doing something wrong, but:

  • If set on /etc/mysql/percona-xtradb-cluster.conf.d/mysqld.cnf as:
server_uuid=some-uuid

Mysql does not start and complains about unknown variable server_uuid.

  • When set on /etc/mysql/percona-xtradb-cluster.conf.d/mysqld.cnf as:
[auto]
server_uuid=some-uuid

Mysql starts but the uuid shown by select @@server_uuid shows the uuid in $datadir/auto.cnf instead of that in mysql.cnf.

  • If I remove $datadir/auto.cnf and restart Mysql, a new auto.cnf file is created with a different uuid.

The documentation does not mention that server_uuid can be read nor set in any other file or way but in auto.cnf. Given that such file is deleted during an SST, I see no way to keep the same server_uuid in order to avoid rebuilding the replication from scratch if an SST happens on master.

You are correct in that server_uuid doesn’t work inside my.cnf, but it does work inside auto.cnf in my testing of 5.7.38. The SELECT @@server_uuid matched what was in auto.cnf.

In any case, since this file is replaced by SST, we are back to original issue.

If mysql04 is using AUTO_POSITION=1, when replication breaks from 03->04, you should be able to go to 04 and simply CHANGE MASTER to mysql02 and resume. Because GTIDs are in use everywhere, the same GTID set should exist on mysql02 and replication should resume.