We have two servers set up in master-master replication (with all traffic sent to the active one, and the second one being passive, so we never have the situation where both are writing to the database).
Both are running 5.6.41. Replication is with binlog_format=ROW. Up to now, we have rarely had replication problems with percona. On the very rare times when we have had replication break, we have a script that will rebuild the passive master from the active one:
- stop slave on both
- dump the active master with:
mysqldump --single-transaction --flush-logs --master-data=2 --all-databases --routines --triggers --add-drop-table --add-drop-trigger > file.sql - load the file on passive master:
mysql < file.sql - running ‘show master status’ on both
- plug the master status values fro the active master into the passive, and those from the passive host into the active one
So it’s a full rebuild, and it’s worked for quite a while (we did this when we were still running mysql which had lots of replication problems… we’ve only had to do it a couple times with percona).
Percona has been very stable for us for several years, but a few months ago, we had to switch from binlog_format=STATEMENT to ROW (a webapp required it, and it seemed that ROW replication was the preferred way to go). After the change, percona continued to be stable.
Unfortunately, about a week ago, we restarted percona (we did OS security patches, and we always reboot hosts as part of that), and the active host recovered fine, but something about the passive one failed, and we had got those row replication errors. I’ve been doing a lot of research over the past week trying to figure out what’s going on, and I’ve run way past my level of expertise.
Anyway, slave status on the active master comes back fine, but on the passive one, we’re getting errors:
Last_Errno: 1032
Could not execute Delete_rows event on … Can’t find record in … Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND;
Could someone give me some guidance on what I need to do to reestablish replication (and hopefully prevent this from happening again)?
I’ve considered:
upgrading to 5.7
switching to GTID replication
but I haven’t found anything that seemed to indicate that they had a strong chance of correcting the problem, so I haven’t wanted to introduce another variable into the mix.