Hi @kfarrahe
Unfortunately you will still experience significant load conditions even using --limit since a full table scan will be required to examine all rows on the table. As your purging is ongoing you will be removing rows from the Production table, meaning your full table scan is looking at less and less rows each time.
One idea could be to use very large --limit / --txn values. This will mean that the DELETE will run for much longer and could potentially lead to locking conditions, but the tradeoff is that your subsequent full table scan will look at many less rows.
Another way to look at this is to see whether an alternative indexed field (the primary key or secondary index) could be leveraged. For example if this table uses PK auto_increment, you may find that records older than your 9 month cutoff are all in the range of 1-1000. Then you can use pt-archiver to use the PK. Of course this could be dangerous if your app allowed for updates and then the range wouldn’t be contiguous, so take this plan with a grain of salt - it might help you identify one big section of records to remove, but I wouldn’t use it to identify the 200mil records you want to evict from the table.
Here’s a quick example using a table with time fields to use the WHERE clause against.
CREATE TABLE `t1` (
`pk` int unsigned NOT NULL AUTO_INCREMENT,
`ts` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`dt` datetime DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`pk`)
) ENGINE=InnoDB AUTO_INCREMENT=438 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
I filled it with 400 rows:
mysql> select min(pk),max(pk) from t1;
+---------+---------+
| min(pk) | max(pk) |
+---------+---------+
| 1 | 400 |
+---------+---------+
1 row in set (0.00 sec)
I’m intentionally selecting 200 rows from the middle of the range which are sequential. Notice how EXPLAIN reports we will be examining 400 rows (not 200 which is the range, not 100 which is the LIMIT offset):
mysql> explain select pk from t1 where ts between '2021-01-28 15:28:26' and '2021-01-28 15:31:03' limit 100\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 400
filtered: 11.11
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Same query but using LIMIT 10:
mysql> explain select pk from t1 where ts between '2021-01-28 15:28:26' and '2021-01-28 15:31:03' limit 10\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 400
filtered: 11.11
Extra: Using where
1 row in set, 1 warning (0.00 sec)
So - LIMIT isn’t going to be helpful with reducing the load, you’ll have to just “live with it” and let the tool run for likely several weeks (maybe months).