pt-online-schema-change filter on data moved to new table


The problem is I have a big table and have to delete a substantial part of data from it, say 60%.

It would be nice to do it the way pt-online-schema-change makes DDL changes, i.e. create a new table, and move all the records I want to be left in the table to it, then swap the tables.

The gains are:

  1. The data is moved in small chunks automatically, I don’t need to care about replication lag;
  2. Only 40% of the data is modified (rewritten to a new table) instead of 60%;
  3. All the unused disk space goes back to OS, the new table consumes disk space only for those 40% of the records, which were actually moved.

The solution could be to add an option that adds WHERE condition to every chunk of data moved to the new table. So every chunk would be limited by pk id (or any other column if --chunk-index is used) and this arbitrary where condition.

It might seem strange, that this use case of pt-online-schema-change does not suppose a schema change at all, but why not, if the algorithm is almost the same? =)

Hi atmc;

I often use pt-archiver for similar situations to what you are describing. It takes a bit more work since it does not swap tables for you automatically, but what you can do is create a new empty table like your source table, then use pt-archiver with the --no-delete to copy your good rows to the new table (you can even do chunks like you want), and then swap them manually. If your source table is live then you’ll have to plan out a way to do the swap by either stopping incoming data to the table, or locking the source table long enough to copy over any new good rows since your last pt-archiver run, and then rename the tables to swap them. Basically the same thing pt-online-schema-change does, except you do it manually.

So this is not to take away from your suggestion for pt-online-schema-change, but just an alternative solution in the mean time. =)


Bump. Can we make this an official feature request?