Pt-archiver Tuning and performance

Hello,

I would like to archive old entries more than 9 months old from a 250 GB table that is in production. 200 000 000 records.
My “where” is based on a field that is not indexed. For reasons I cannot index at the moment.
I have created a second server. in this second server I have installed pt-archiver.
My database server in production has 16 GB memory and 4 VCPU.
this server must not stop. it plays the role of Master and I have a salvo.

Can you give me any idea for the tuning parameters.

2 Likes

Without an index, you will full-table-scan constantly and performance will be very bad. There is no parameter that you can change to fix this. It will still work, but it will be slow and may increase load on your master. Have you looked at pt-online-schema-change to add the index?

2 Likes

Thanks for your answer.

I don’t have much experience with pt-online-schema-change.
I can’t index either because of the disk space. and the database is in production Environement.
I cannot modify anythings.
if i used --limit with 1000 and txn 1000 all night long, can i maybe avoid the load?

2 Likes

Hi @kfarrahe
Unfortunately you will still experience significant load conditions even using --limit since a full table scan will be required to examine all rows on the table. As your purging is ongoing you will be removing rows from the Production table, meaning your full table scan is looking at less and less rows each time.
One idea could be to use very large --limit / --txn values. This will mean that the DELETE will run for much longer and could potentially lead to locking conditions, but the tradeoff is that your subsequent full table scan will look at many less rows.

Another way to look at this is to see whether an alternative indexed field (the primary key or secondary index) could be leveraged. For example if this table uses PK auto_increment, you may find that records older than your 9 month cutoff are all in the range of 1-1000. Then you can use pt-archiver to use the PK. Of course this could be dangerous if your app allowed for updates and then the range wouldn’t be contiguous, so take this plan with a grain of salt - it might help you identify one big section of records to remove, but I wouldn’t use it to identify the 200mil records you want to evict from the table.
Here’s a quick example using a table with time fields to use the WHERE clause against.

CREATE TABLE `t1` (
  `pk` int unsigned NOT NULL AUTO_INCREMENT,
  `ts` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  `dt` datetime DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`pk`)
) ENGINE=InnoDB AUTO_INCREMENT=438 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

I filled it with 400 rows:

mysql> select min(pk),max(pk) from t1;
+---------+---------+
| min(pk) | max(pk) |
+---------+---------+
|       1 |     400 |
+---------+---------+
1 row in set (0.00 sec)

I’m intentionally selecting 200 rows from the middle of the range which are sequential. Notice how EXPLAIN reports we will be examining 400 rows (not 200 which is the range, not 100 which is the LIMIT offset):

mysql> explain select pk from t1 where ts between '2021-01-28 15:28:26' and '2021-01-28 15:31:03' limit 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 400
     filtered: 11.11
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

Same query but using LIMIT 10:

mysql> explain select pk from t1 where ts between '2021-01-28 15:28:26' and '2021-01-28 15:31:03' limit 10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 400
     filtered: 11.11
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

So - LIMIT isn’t going to be helpful with reducing the load, you’ll have to just “live with it” and let the tool run for likely several weeks (maybe months).

2 Likes

Hi @kfarrahe,
building on Michaels suggestion:
If your table does have an indexed primary key, you might try extending your “where” condition by adding a condition for that PK, so that both the PK and your condition are used, and than handle the table in several smaller chunks.
For instance, if your indexed PK is “id” and your current “where” is “bytes=123456”, try using calls with
“id BETWEEN 1 AND 100000 AND bytes=123456”
“id BETWEEN 100001 AND 200000 AND bytes=123456”
etc.
This might enable the optimizer to use the PK to limit the query on the first 100.000 records instead of all 200.000.000, and if those 100.000 can be selected via the PK index, that should make things faster.
You will have to repeat that with different PK ranges until you have covered all 200.000.000 records, but if it makes each single run fast enough, it might help.
I tried it with a simple query on a large table. Didn’t test if it works the same with pt-archiver, but it might be worth a try.
And please start experimenting with SELECT queries first before you try it with pt-archiver. :slight_smile:

2 Likes