False negatives from pt-table-checksum

I have come across a problem in which pt-table-checksum is reporting false negatives, i.e. failing to report differences in even very small tables.

The situation:

Customer has an existing MySQL cluster consisting of two MySQL 5.5.32 master-slave pairs, with master-master replication between the two masters. If we call the pairs A,B and C,D, then A and C are the masters and their replication topology is: B<-A<=>C->D
There is also a new five-node XtraDB 5.6.22-72 cluster, with a single asynchronous replication slave for backups. Node 1 of the cluster, for now, replicates asynchronously from node C in the above topology, and will do so until migration is accomplished. To (hopefully) ensure compatibility and consistency with the cluster, replication on A-B-C-D has been set to ROW, since the cluster’s internal replication is and must remain ROW. Due to the sheer volume of traffic the customer is processing, replication between A and C is by now routinely falling behind by as much as an hour during the day, with obvious impacts on the cleanliness of the data.

To validate that data on the cluster matches that on the production servers prior to attempting migration to the new cluster, the customer is running pt-table-checksum on node C. pt-table-checksum is of course setting SESSION BINLOG_FORMAT to STATEMENT; equally obviously, this is not propagating past node A, so checksums reported from B cannot be trusted. That’s OK. We don’t actually care about checksums from B. What we care about is that the data on C, which has been declared the authoritative copy of the data, and the cluster match. And that should be fine for pt-table-checksum, because there is only a single replication link between node C and cluster node 1, so checksums between C and cluster node 1 should be accurate.

Unfortunately, they are not. pt-table-checksum is reporting tables as having zero diffs and matching checksums between C and cluster node 1, when we can look at the two tables side by side and immediately see at a glance that they are different. This is alarming, because if pt-table-checksum is lying to us and failing to report diffs that we know exist, we cannot trust what it tells us about any of the other data. And we cannot manually compare almost a terabyte of DB data, and the production environment cannot be taken offline to check all of the data. (Nor can it be taken offline to update it.)

Can anyone shed any light on why pt-table-checksum, in this configuration, is throwing false negatives?

I’ve dug further into this, and this problem is serious. It turns out it has nothing to do with replication; the query that pt-table-checksum is using to calculate chunk checksums is flawed.

The checksum query, as extracted from a binary log, is:
SELECT ‘schema name’, ‘table name’, ‘1’, NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS(’ #', id)) AS UNSIGNED)), 10, 16)), 0) AS crc FROM schema.table;

Now, let’s look at a sample table. Let’s say it has this set of rows on server 1:

±—±-------±-------------±-----------+
| id | lbc_id | rebill_depth | percentage |
±—±-------±-------------±-----------+
| 21 | 1 | 1 | 100 |
| 23 | 1 | 2 | 100 |
| 25 | 1 | 3 | 100 |
| 27 | 1 | 4 | 100 |
| 29 | 1 | 5 | 100 |
±—±-------±-------------±-----------+

And this set on server 2:

±—±-------±-------------±-----------+
| id | lbc_id | rebill_depth | percentage |
±—±-------±-------------±-----------+
| 21 | 1 | 1 | 100 |
| 22 | 1 | 2 | 100 |
| 24 | 1 | 3 | 100 |
| 26 | 1 | 4 | 100 |
| 28 | 1 | 5 | 100 |
±—±-------±-------------±-----------+

Notice that the data values are the same, but the row IDs are different.

Now, let’s take that query starting from the inside. We’ll skip a few steps and go as far as select CAST(CRC32(CONCAT_WS(‘#’, id)) AS UNSIGNED) FROM schema.table.
Server 1:

±----------------------------------------------+
| CAST(CRC32(CONCAT_WS(‘#’, id)) AS UNSIGNED) |
±----------------------------------------------+
| 4252452532 |
| 326707096 |
| 4196041389 |
| 336913281 |
| 4088188550 |
±----------------------------------------------+

Server 2:

±----------------------------------------------+
| CAST(CRC32(CONCAT_WS(‘#’, id)) AS UNSIGNED) |
±----------------------------------------------+
| 4252452532 |
| 1685985038 |
| 2367533627 |
| 1662243607 |
| 2225864208 |
±----------------------------------------------+

OK, this is as we expect. The first row checksum matches, and the remaining four don’t. So now let’s go out one set of braces and perform the BIT_XOR.

Server 1:

±-------------------------------------------------------+
| BIT_XOR(CAST(CRC32(CONCAT_WS(‘#’, id)) AS UNSIGNED)) |
±-------------------------------------------------------+
| 4088188550 |
±-------------------------------------------------------+

Server 2:

±-------------------------------------------------------+
| BIT_XOR(CAST(CRC32(CONCAT_WS(‘#’, id)) AS UNSIGNED)) |
±-------------------------------------------------------+
| 4088188550 |
±-------------------------------------------------------+

Ummmmmmmmmmmm… Houston, WE HAVE A PROBLEM.

I note that CRC32 is being used here, and pt-table-checksum itself warns that CRC32 “is too prone to hash collisions”. But CRC32 is not the problem here. We are getting distinct sets of checksums out of CRC32. The problem is that we are getting distinct sets of CRC32 checksums that BIT_XOR to the same result. And from then on, everything outside that BIT_XOR doesn’t matter.

As a footnote, if we compare only the first FOUR rows of the table, we get different BIT_XOR results as we expect we should. But with five rows, the data values turn out to be just right to return the same BIT_XOR for both sets of rows. BIT-XOR is effectively being used here as a hash function itself, and it is a terribly, terribly weak one.

This is JUST ONE example table for which this occurs. The customer I am working with has numerous such tables.

It should also be noted that using a stronger hash function such as MD5 or murmur will GREATLY REDUCE the chance of a BIT_XOR collision, but will not eliminate it. But it is vital to understand that it is not actually the “real” hash function that is the weakness being demonstrated here — it is the XOR used as a hash function. It should also be noted that the smaller the number of rows in the table, the more likely an XOR collision is.