Newbie Question: MySQL 5.7 Slave Tables Out of Sync

Hello,

This is my first post and I am reaching out to the Percona community to provide me some guidance with a problem I am having.

We are our Master / Slave replication. Two of our Drupal tables have fallen out of synch. The tables are “users” and “users_roles”. Replication halts when an update event is made on a record within the users table that while exists on Master, it is not on Slave. By our count on Slave, the “users” table is missing 300 records while users_roles is missing 10. All other tables within the database appear to be in sync. There are no sync issues with other databases on this Master that are being replicated to this Slave. Below is the error from a custom monitoring script I created to detect this:


Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Error: Could not execute Update_rows event on table adid_prod_d7.users; Can't find record in 'users', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.017097, end_log_pos 165292698
Seconds_Behind_Master: NULL
Last_IO_Error: 
Last_SQL_Error: Could not execute Update_rows event on table adid_prod_d7.users; Can't find record in 'users', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.017097, end_log_pos 165292698

To remediate this problem, I manually stop the slave process (STOP SLAVE), set the global skip counter to 1 and then start the slave. This works until the next time this happens.

This issue first happened back in October and to remediate the problem I followed the directions from this Percona article: https://www.percona.com/blog/2008/06/29/resyncing-table-on-mysql-slave/. This worked as the “users” and “users_roles” tables had an equal number of rows but this week the same problem occurred several times and much to our surprise the same two tables (“users” and “users_roles”) were out of sync again. When I followed the directions to this article, I made sure I took the websites offline so that nothing was being written or read from the Master.

I have downloaded the Percona toolkit and have started to look at the documentation. I will admit that I do not know what to look for when running the pt-mysql-summary command or how this toolkit will affect my production environment. Will I have to notify my customers that my sites will be offlinett to run this toolkit? Besides backing up the databases prior to invoking pt-table-sync, what do I need to do beforehand and what are the best options to use.

Thus far I am thinking the command will be:

pt-table-sync --dry-run --sync-to-master --replicate -h slavedb

I am looking for some guidance here.

Thanks

Hello, pt-table-sync is the right tool here and should help you fix the data drift with no downtime to your production environment. Have a look at https://www.percona.com/blog/2015/08/12/mysql-replication-primer-with-pt-table-checksum-and-pt-table-sync/ for some examples.

As for preventing the problem from happening again, I suggest you start by making sure full durability is enabled on the primary (sync_binlog=1, innodb_flush_log_at_trx_commit=1) and also enable crash-safe replication on the replica (see https://www.percona.com/blog/2013/09/13/enabling-crash-safe-slaves-with-mysql-5-6/)

Hope that helps

Thank you for your reply. It really gave me a lot of clarity. I am concern than when running pt-table-checksum to help determine out of sync issues, is forcing me to run with the --no-check-binlog-format option with the warning that this could break replication. What is at issue here and how likely could such a thing happen? I am using version 3.2.1.

Running with --no-check-binlog-format won’t break replication. I think you might be confusing with the option --no-check-replication-filters which is the more risky one.

For more insight you can read https://www.percona.com/blog/2018/11/22/caveats-pt-table-checksum-using-row-based-replication-and-filters/.

I was incorrect in my question. What I meant was in reference to this warning I got. My question should have been about a concern that pt-table-checksum could break my replication.

Replica db02 has binlog_format ROW which could cause pt-table-checksum to break replication. Please read “Replicas using row-based replication” in the LIMITATIONS section of the tool’s documentation. If you understand the risks, specify --no-check-binlog-format to disable this check.

I think this is just a generic warning message that is printed but doesn’t apply for the --no-check-binlog-format. I am not aware of any scenario where replication can break due to binlog_format being different

Thank you once again. This will help me a lot in reviewing again the documentation.