Unable to fix broken replication

We have two servers set up in master-master replication (with all traffic sent to the active one, and the second one being passive, so we never have the situation where both are writing to the database).

Both are running 5.6.41. Replication is with binlog_format=ROW. Up to now, we have rarely had replication problems with percona. On the very rare times when we have had replication break, we have a script that will rebuild the passive master from the active one:

  1. stop slave on both
  2. dump the active master with:
    mysqldump --single-transaction --flush-logs --master-data=2 --all-databases --routines --triggers --add-drop-table --add-drop-trigger > file.sql
  3. load the file on passive master:
    mysql < file.sql
  4. running ‘show master status’ on both
  5. plug the master status values fro the active master into the passive, and those from the passive host into the active one

So it’s a full rebuild, and it’s worked for quite a while (we did this when we were still running mysql which had lots of replication problems… we’ve only had to do it a couple times with percona).

Percona has been very stable for us for several years, but a few months ago, we had to switch from binlog_format=STATEMENT to ROW (a webapp required it, and it seemed that ROW replication was the preferred way to go). After the change, percona continued to be stable.

Unfortunately, about a week ago, we restarted percona (we did OS security patches, and we always reboot hosts as part of that), and the active host recovered fine, but something about the passive one failed, and we had got those row replication errors. I’ve been doing a lot of research over the past week trying to figure out what’s going on, and I’ve run way past my level of expertise.

Anyway, slave status on the active master comes back fine, but on the passive one, we’re getting errors:

Last_Errno: 1032
Could not execute Delete_rows event on … Can’t find record in … Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND;

Could someone give me some guidance on what I need to do to reestablish replication (and hopefully prevent this from happening again)?

I’ve considered:
upgrading to 5.7
switching to GTID replication
but I haven’t found anything that seemed to indicate that they had a strong chance of correcting the problem, so I haven’t wanted to introduce another variable into the mix.

Hello SullyBeck thanks for posting.

The implication is (you’ll already realize, I know!) that your data is not in sync. There are a number of potential reasons: human error during switchover, unsafe queries, inserting data on wrong server, application error, mysql bug… and so on. The reason is likely to be peculiar to your application and environment.

To rectify the issue then you’ll need to look at pt-table-checksum and pt-table-sync from the Percona Toolkit - see here: https://www.percona.com/doc/percona-toolkit/LATEST/pt-table-checksum.html and https://www.percona.com/doc/percona-toolkit/LATEST/pt-table-sync.html

As always, test these somewhere first so that you can understand what they are doing before using them on your production databases.

The difficulty is that we have webapps running which require ROW replication, and the pt-table-* tools require STATEMENT replication. I investigated switching (even temporarily) to STATEMENT replication, and decided that wasn’t viable (it may become viable if this problem persists). As a result, it was actually more useful to us to just do a full replacement of the second master (the one which is NOT receiving requests, and also the one for which replication is broken). I can’t tell why a complete rebuild (using the procedure I outlined in the first post) would create a clean replica to start with.

As a sidenote… it would really be wonderful if the pt-table-* tools could support ROW based replication, especially since it seems to be the preferred method of replication at this point!

Thanks for that update SullyBeck I’ll see if any of the tech team have anything they can add …

It seems like the STATEMENT limitation in pt-table-checksum is down to the way MySQL internals… there’s a discussion about that here (you have probably seen), it looks like our dev’s hands are tied on that one, sadly https://jira.percona.com/browse/PT-763

Yes, I understand that for the time being, the pt-table-* tools aren’t going to be of assistance, so really, at this point, I’m not so interested in diagnosing the specific data that is missing. To be honest, the problems are always on the passive server, so I’d be perfectly happy at this point to simply rebuild the passive host’s database and reestablish replication. The problem is that that is what I thought I was doing… but I immediately get the row errors I mentioned.

Am I missing some option in the mysqldump command that will ensure that when I load it into the passive host, it will completely clear out the old data? Alternately, do I need to completely delete the instance on the passive host and rebuild it? If so, that would seem to say to me that somewhere in the percona instance is some data that is corrupt that isn’t getting reinitialized using the steps I outlined.

I’d like to understand what is going on AND how to correct it at this point.

Hi SullyBeck,
There is absolutely no requirement whatsoever that you change your MySQL replication from ROW to STATEMENT when using pt-table-checksum/pt-table-sync. These tools will manage this automatically as binlog format can be changed at the session level. The tools will change it as necessary when executed, and only change it for their session. MySQL is perfectly capable of processing a binary log with mixed formatting.

As a side note, these two tools will probably never support ROW because they cannot operate otherwise without massive code rewrite. If you look at the debug output (export PTDEBUG=1) when running pt-table-checksum (be aware it will be TONS and TONS of output), you will see the SQL that is used to generate the actual checksum values. If ROW was used, then the REPLACE statement would replicate and insert to the slave using the values calculated on the master. Thus, you would never see a differences on any slaves.

5.7 is currently recommended in all cases. GTID, my opinion, stay away from. It’s more difficult to diagnose than coordinate-based.

If the passive slave cannot delete a row due to it being missing (as your example above states), I would simply skip it and move on. STOP SLAVE; SET GLOBAL sql_slave_skip_counter=1; START SLAVE; If it cannot update because of missing row (the error would be different), that means missing data too. Only option is to rebuild. Highly suggest you switch to using xtrabackup over mysqldump. Check out our docs for how to create a streaming backup.

Thanks for the reply. I didn’t understand that pt-table-* tools would work in my setup. I’ll look into that a bit more.

I DID try the sql_slave_skip_counter thing… I actually wrote a trivial script to ‘skip_counter=1, restart, wait 1 second, retry if slave status not yet correct’, and I was able to briefly reestablish replication, but after a short time (typically less than 15 seconds), a similar error would show up. I’d typically end up skipping between 5-20 times, replication would seem to work, and then after a bit break again. I ran this quite a few times just to see if I could get rid of all of the errors that way, but after doing this 10+ times, I decided it wasn’t going to work.

I’m going to switch to begin the process of switching to 5.7. Also, I’ll try using xtrabackup and see if those backups fix the problem.

You’ll continue to get broken replication as long as the data is different between the two masters. Make sure that your passive master is super_read_only=1 to prevent any/all accidental writes which could affect replication and data drift.