Effecient Data Archiving Practices?

clint · October 10, 2006, 1:45pm

Hi,

I’m writing for advice on the most efficent practices to archive data into smaller, horizontally partitioned tables.

I have a logging type table (‘logs’ for lack of a better name) which is currently taking 200K records on a daily basis. The table is growing fast enough where I would like to horizontally partition the table into smaller, date-centric tables (i.e. logs_YYYYMM). There would be a primary ‘logs’ table that would take the production inserts, and would be the staging area for migrating log data into the currently applicable archive table (not to be confused with ARCHIVE storage engine).

The time interval is still up in the air, but my initial plan is to run an hourly archiving process to move data from ‘logs’ into the proper archive table. All tables will be using MyISAM storage engine. I was planning on not having an auto_incrementing field in the ‘logs’ table as the records are just going to be cleaned out as they get archived. However, the archived tables would have an id type auto incrementing primary key. Is this a bad decision to not have an id field on the main ‘logs’ table?

My primary question is what the most efficient way to perform this type of data migration with the least impact on the production logging into the ‘logs’ table in a MySQL 5.0 environment?

Any advice would be greatly appreciated.

Thanks,

Clint

Peter · October 10, 2006, 4:03pm

Clint,

First I should ask you question How exactly are you going to use this data, depending on this different structures may be appropriate.

On some of my servers I log profiling information in the logs, one log file per day. I use no indexes as reporting group by queries can run doing full table scans just fine.

If I need report over longer period of time, say week I create temporary merge table and run query among it.

Speaking about auto_increment - you can renumber fields on inserts, but all depends on what you want to archive.

clint · October 11, 2006, 3:27pm

Hi Peter,

Thank you for your quick response.

The data being stored is user playback data from resources available on our site. The data bein stored is a date of creation, IP address, user identifier (if available), resource identifier, referral url, and a couple tinyint type boolean fields for further internal statistical breakdown.

The raw data in the archived ‘logs’ tables will never directly be accessed by users visiting the site (other than inserts into the default/current ‘logs’ table. However, this may change at some point as I would like to move to a logging mechanism to scrub the access logs to insert this data, rather than live inserts into the DB from the triggering event). Instead, there will be processes in place to build slimmer, more effecient data marts for things such as demographical playback statistics per resource. The data marts are what the users will directly access for such statistics.

Any new business rule will need to analyze all the archived log data to put it into an appropriate data mart. However, once a new process is in place, the only data that should need to be looked at is the new data coming into the system (current/primary ‘log’ table or the current ‘log_YYYYMM’ table).

So again, what I really want to know is what are the best data migration practices inside of a production environment that will have the least impact on production (non-system) level activity?

Thanks again,

Clint

Users will never directly have access to this log (playback) data

clint · October 16, 2006, 12:50pm

Hi Peter,

How can I talk you into sharing some wisdom on this subject?

Peter · October 17, 2006, 6:09am

clint,

If you’re not happy with forum response speed you’re free to request consulting via consulting at mysqlperformanceblog com

I only respond forums as time permit.

Speaking about your question - if you mainly migrate data to the warehouse for querying and do it rather frequently you can just have log file per day in MyISAM or even Archive format.

clint · October 23, 2006, 5:28pm

Yes, that was my plan was to break tables up into daily or monthly partitions. My question was in the most effecient way to move the data over. It’s the task of frequently loading tens of thousands of records of data in one table and deleting them from the source table, all while the source table is being directly populated by users actions. I was curious as to what you considered to be the most effecient practice in loading data (mass SELECT … INSERT, mysqldump and import, incremental select then insert, some super cool thing I havent heard of, etc ))

Peter · October 23, 2006, 6:45pm

No, No need to delete. Delete is slow while creating or truncating tables is fast.

After you loaded data from active table, which can be INSERT… SELECT you can simply drop it or truncate it if you like.

In your case I would probably different tables to use for data stream insertion and for integrating data in your database.

You can have always going asynchronous import process which would be quite interesting in the sense of being self adapting - if you have light load it will cycle quickly and reduce data loading delay, on high load it will self adapt and become more bulky.

Topic		Replies	Views
Mantain daily log tables by merging or partitioning Other MySQL® Questions	1	654	July 26, 2013
help for large tables Other MySQL® Questions	5	519	July 4, 2009
Is the archive storage engine an option Other MySQL® Questions	1	490	August 25, 2010
SELECT queries block LOAD DATA queries Other MySQL® Questions	3	497	July 24, 2009
using timestamp as a part of pk [innodb] Other MySQL® Questions	0	364	March 28, 2008

Effecient Data Archiving Practices?

Related topics