MySQL Failover Query

Hi Team,

Does it really required to rebuild the old primary always after failover of primary to local replica in case of async replication.

Can you share any document with scenarios explained for such failovers

Thanks

Hello @Hussain_Patel,
If you replicate like this A → B, and then failover to B, no writes will go back to A. Thus, you must always rebuild A because B now has data that is missing from A.

You can setup Source/Source replication, A<–>B, and in this case, when you failover to B, A will remain in sync. Even if A goes offline for a reboot, or even if you completely rebuild and restore A, replication will resume.

The best option would be to implement Percona XtraDB Cluster, or Group Replication which are MySQL clustering solutions. Both of these will automatically rebuild a failed node when it rejoins the cluster.

Hello @matthewb

Can we point A to B after B takes over primary role after failover. Do we need to always rebuild A?
Example if GTID’s on replica B before failover matches A after A comes back up? Do we need to rebuild?

If you have GTID enabled, then you should be able to use auto-position to point A (replica) to B (source), sure. Writes on B will generate different GTID, but the combination of A’s GTID and B’s GTID merge together. I would setup S/S with GTID.

Thanks Matthew. Appreciate it.

Hi @matthewb

Failed node has to be discarded and rebuild is required.
Failover is different from switchover and is not straightforward/simple as pointing Auto-position back to A.

Failed node can have uncommitted and committed transaction in Async replication and uncommitted transactions in semi-sync replication.

So even in semi-sync, it is recommended to discard and rebuild on an event of failure.

Please check this: https://dev.mysql.com/doc/refman/8.0/en/replication-semisync.html

" With semisynchronous replication, if the source crashes and a failover to a replica is carried out, the failed source should not be reused as the replication source, and should be discarded. It could have transactions that were not acknowledged by any replica, which were therefore not committed before the failover.

If your goal is to implement a fault-tolerant replication topology where all the servers receive the same transactions in the same order, and a server that crashes can rejoin the group and be brought up to date automatically, you can use Group Replication to achieve this. For information, seeChapter 20, Group Replication."

Thank you
Chanakya

We don’t care about uncommitted transactions anywhere. If the app sends a trx, and commits, and never receives a reply due to the server crashing, then the app should treat that as failed trx, and should repeat the trx on the new node.

A failed node with committed trx still has a GTID associated with that trx since it did commit. Thus, when you bring failed node back online, the other node will request any missing GTIDs from the failed node, which would bring both nodes in sync.

Semi-sync is deprecated and will be removed from MySQL, so that’s no longer a concern.

And, if you see above, I did recommend PXC/GR as better solutions, but that does not negate the fact that you can reuse a failed node with async in most situations.

I requested to share docs on semi sync deprecating in a different thread, I couldn’t find any. please share you could.

Yes we do not care about uncommitted transactions but when the crash recovery happens, failed possibly will have transactions that are not present in the promoted node right ? That will end up in data inconsistency if added back without a rebuild. I am sure this is why official MySQL documentation also recommends to discard failed node.

In an Async case, committed transactions on failed that were treated as “no response” from app or even if there is response but never reached any of the replicas can also cause promoted node to not have the data when the failed node comes backup. Either of those cases forces a rebuild to add the node back into the replication pool.