pt-table-checksum questions about algorithm


excuse me for my bad Perl knowledge, I was not able to read and understand pt-table-checksum code thoroughly so my questions might be quite simple. Here they are.

  1. It issues $ub_sql query to get {upper} boundary for current chunk and {next_lower} boundary for next chunk if available. Does it check replicas have no data between {upper} and {next_lower}?

  2. Issuing $ub_sql and then $checksum_dml partially reads same data twice (LIMIT with offset has to pass that many rows). Considering master server it is run on is often the main production server and is quite IO busy/bound it is either resource intensive or time consuming (or both). Why cannot it issue $checksum_dml only but with $ub_sql LIMIT instead of WHERE boundaries, log max boundary to percona table and check it is the same across all replicas? This will change O(N) complexity for running $ub_sql to O(1) complexity for running one row select from percona table.

I mean query like this (for simplicity I assume index single column):

REPLACE INTO percona.checksums (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc)
SELECT ‘schema’, ‘table’, ‘chunk’, ‘chunk_index’, ‘lower_boundary’,
MAX(index_field) AS upper_boundary,
COUNT(*) AS cnt,
COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS(’ #’, field1, …, fieldN, CONCAT(ISNULL(field1), …, ISNULL(fieldN)))) AS UNSIGNED)), 10, 16)), 0) AS crc
FROM schema.table FORCE INDEX(chunk_index)
WHERE index_field>=:upper_boundary_of_previous_chunk:
ORDER BY index_field
/checksum chunk/

At least one row which satisfies WHERE index_field=:upper_boundary_of_previous_chunk: will be read twice this way. This won’t spoil checksum results but is unnecessary overhead. I’m quite sure this can be avoided with something like MAX(CASE … END) however I don’t want to go that far now.

Thanks for the answers in advance.