I posted this in the SO DBA forum and it was recommended I try here:
Basically, I have a 1.9 billion row DB that I need to do an initial large archive of and then I’ll keep a rolling 3-month window of data using pt-archiver. I’ve written a script to run pt-archiver 1 day at a time using the code referenced in that link.
I’ve noticed using --dry-run that it limits results to id < 999999 and with a table this large I far exceed that limit. How do I modify this limit? I see no option in the pt-archiver documentation that explains how I can remove it. When I try running my script I get data for some dates and then it just falls off when I know I have data in the DB. If I run the --dry-run queries without that ID limitation they return data.
Thanks in advance for your assistance.
Someone from Percona who has worked on this project might know more details, but below is my $.02 as someone who has used the tool quite a bit at least.
pt-archiver chunks the rows, so I believe what you are seeing with the id < 999999 where clause is the first chunk printed by dry-run, which is then split up further by your 1000 limit. To test that theory, just run your pt-archiver command with --no-delete and then kill it after you are comfortable with what it is doing, as you can watch the SELECT queries it is generating until you see the id < 999999 where clause change to something larger. The --no-delete option will “archive” the rows, but not delete them from the source so you can then re-run pt-archiver later once you are ready to start deleting the records.
That’s what I was saying - I ran it and it failed to pick up data after a certain date. The data is chronological and it is odd that it just plain stopped getting data after that date. The only explanation I could think of was the ID being > 999999.
How does it know how high to chunk those ranges? I assume from the next PK?
I’ll do a little more debugging tomorrow. Thanks, Scott.
Not sure on the chunking specifics, so hopefully an engineer from Percona can chime in on that.
So when you say it stops getting data, does that mean if you run a SELECT yourself where date_id=$dateid, it pulls back more data that pt-archiver copied doing the same thing?
Do you have any logging for this as well? Could be worth running pt-archiver in a screen session with --progress enabled so you can see what it’s doing and where it stops.
Basically I’d remove the script all together (you may have already done that), and just test a single pt-archiver statement and verify the results that way.
The other variable is the ascend-first option, as I’ve never used that. There could be a bug there that might limit it, so that is entirely possible.
I figured it out. I imported 1M rows from my real DB into a test DB. I had the script pointing at the test DB on accident so pt-archiver was just looking for any ID in the system in the test DB up to a maximum. When I point it at the real DB I get a much larger number.
Thanks for the help, Scott.
Ah, glad you figured it out! The simple answers tend to be the hardest to find, but great when you finally find them and it’s an easy fix anyway.
I’d be interested to hear how your archive initiative turns out overall once you get the table cut down in size. I deal with 2TB+ tables myself (and around 15k MySQL instances), and it’s rare to find other real world examples of people dealing with truly large data sizes / volumes in MySQL. =)