I have come across a problem in which pt-table-checksum is reporting false negatives, i.e. failing to report differences in even very small tables.
The situation:
Customer has an existing MySQL cluster consisting of two MySQL 5.5.32 master-slave pairs, with master-master replication between the two masters. If we call the pairs A,B and C,D, then A and C are the masters and their replication topology is: B<-A<=>C->D
There is also a new five-node XtraDB 5.6.22-72 cluster, with a single asynchronous replication slave for backups. Node 1 of the cluster, for now, replicates asynchronously from node C in the above topology, and will do so until migration is accomplished. To (hopefully) ensure compatibility and consistency with the cluster, replication on A-B-C-D has been set to ROW, since the cluster’s internal replication is and must remain ROW. Due to the sheer volume of traffic the customer is processing, replication between A and C is by now routinely falling behind by as much as an hour during the day, with obvious impacts on the cleanliness of the data.
To validate that data on the cluster matches that on the production servers prior to attempting migration to the new cluster, the customer is running pt-table-checksum on node C. pt-table-checksum is of course setting SESSION BINLOG_FORMAT to STATEMENT; equally obviously, this is not propagating past node A, so checksums reported from B cannot be trusted. That’s OK. We don’t actually care about checksums from B. What we care about is that the data on C, which has been declared the authoritative copy of the data, and the cluster match. And that should be fine for pt-table-checksum, because there is only a single replication link between node C and cluster node 1, so checksums between C and cluster node 1 should be accurate.
Unfortunately, they are not. pt-table-checksum is reporting tables as having zero diffs and matching checksums between C and cluster node 1, when we can look at the two tables side by side and immediately see at a glance that they are different. This is alarming, because if pt-table-checksum is lying to us and failing to report diffs that we know exist, we cannot trust what it tells us about any of the other data. And we cannot manually compare almost a terabyte of DB data, and the production environment cannot be taken offline to check all of the data. (Nor can it be taken offline to update it.)
Can anyone shed any light on why pt-table-checksum, in this configuration, is throwing false negatives?