Could pt-archiver use a non-unique key as chunk index?

Assume I have a table defined as follows:

create table t(id int primary key, c1 int,c2 int, idx_c2(c2))engine=innodb

I want to archive data satisfying the following condition:

c2>=value

When I use --dry-run to run pt-archiver, it always picks up the primary key as the chunk index unless specifying the i part in the --source option.

SELECT /*!40001 SQL_NO_CACHE */ `id`,`c1`,`c2` FROM `test`.`t` FORCE INDEX(`idx_c2`) WHERE (c2>=6350) ORDER BY `c2` LIMIT 1000

Can this guarantee no duplicate rows to be archived? Because the SELECT generated has no guarantee it will not produce duplicate rows.

In fact, I expect the generated SELECT like the following:

SELECT /*!40001 SQL_NO_CACHE */ `id`,`c1`,`c2` FROM `test`.`t` FORCE INDEX(`idx_c2`) WHERE  ((c2=? and id>? ) or c2>?) ORDER BY `c2`,`id` LIMIT 1000

and the ? is the last value for the last iteration.

1 Like

pt-archiver needs a unique index in order to “traverse” the table and to ensure that only 1 row is deleted after being inserted into the destination. If you tell pt-archiver to use a non-unique index, then there is no guarantee that you won’t have duplicates or delete the wrong data.

You might try creating an index on (PK, c2), which won’t give you direct uniqueness, but indirect since it uses the PK as a component.

1 Like

Sorry it’s a long time when I post this ticket. But as the idx_c2 actually contains the primary key id internally(In fact every secondary index will contain the primary key columns to make the indexed value is unique in the secondary index tree). So I am curious why pt-archiver does not use that part to guarantee the correctness.