Effecient Data Archiving Practices?

Hi,

I’m writing for advice on the most efficent practices to archive data into smaller, horizontally partitioned tables.

I have a logging type table (‘logs’ for lack of a better name) which is currently taking 200K records on a daily basis. The table is growing fast enough where I would like to horizontally partition the table into smaller, date-centric tables (i.e. logs_YYYYMM). There would be a primary ‘logs’ table that would take the production inserts, and would be the staging area for migrating log data into the currently applicable archive table (not to be confused with ARCHIVE storage engine).

The time interval is still up in the air, but my initial plan is to run an hourly archiving process to move data from ‘logs’ into the proper archive table. All tables will be using MyISAM storage engine. I was planning on not having an auto_incrementing field in the ‘logs’ table as the records are just going to be cleaned out as they get archived. However, the archived tables would have an id type auto incrementing primary key. Is this a bad decision to not have an id field on the main ‘logs’ table?

My primary question is what the most efficient way to perform this type of data migration with the least impact on the production logging into the ‘logs’ table in a MySQL 5.0 environment?

Any advice would be greatly appreciated.

Thanks,

Clint

Clint,

First I should ask you question How exactly are you going to use this data, depending on this different structures may be appropriate.

On some of my servers I log profiling information in the logs, one log file per day. I use no indexes as reporting group by queries can run doing full table scans just fine.

If I need report over longer period of time, say week I create temporary merge table and run query among it.

Speaking about auto_increment - you can renumber fields on inserts, but all depends on what you want to archive.

Hi Peter,

Thank you for your quick response.

The data being stored is user playback data from resources available on our site. The data bein stored is a date of creation, IP address, user identifier (if available), resource identifier, referral url, and a couple tinyint type boolean fields for further internal statistical breakdown.

The raw data in the archived ‘logs’ tables will never directly be accessed by users visiting the site (other than inserts into the default/current ‘logs’ table. However, this may change at some point as I would like to move to a logging mechanism to scrub the access logs to insert this data, rather than live inserts into the DB from the triggering event). Instead, there will be processes in place to build slimmer, more effecient data marts for things such as demographical playback statistics per resource. The data marts are what the users will directly access for such statistics.

Any new business rule will need to analyze all the archived log data to put it into an appropriate data mart. However, once a new process is in place, the only data that should need to be looked at is the new data coming into the system (current/primary ‘log’ table or the current ‘log_YYYYMM’ table).

So again, what I really want to know is what are the best data migration practices inside of a production environment that will have the least impact on production (non-system) level activity?

Thanks again,

Clint

Users will never directly have access to this log (playback) data

Hi Peter,

How can I talk you into sharing some wisdom on this subject? :wink:

clint,

If you’re not happy with forum response speed you’re free to request consulting via consulting at mysqlperformanceblog com

I only respond forums as time permit.

Speaking about your question - if you mainly migrate data to the warehouse for querying and do it rather frequently you can just have log file per day in MyISAM or even Archive format.

Yes, that was my plan was to break tables up into daily or monthly partitions. My question was in the most effecient way to move the data over. It’s the task of frequently loading tens of thousands of records of data in one table and deleting them from the source table, all while the source table is being directly populated by users actions. I was curious as to what you considered to be the most effecient practice in loading data (mass SELECT … INSERT, mysqldump and import, incremental select then insert, some super cool thing I havent heard of, etc ))

No, No need to delete. Delete is slow while creating or truncating tables is fast.

After you loaded data from active table, which can be INSERT… SELECT you can simply drop it or truncate it if you like.

In your case I would probably different tables to use for data stream insertion and for integrating data in your database.

You can have always going asynchronous import process which would be quite interesting in the sense of being self adapting - if you have light load it will cycle quickly and reduce data loading delay, on high load it will self adapt and become more bulky.