How to import large datasets

Hi,
I’m very new to mysql performance and have some questions regarding a project I’m working on.

I guess before I say anything, the project is an e-commerce store.

I have two tables right now I’m concerned about:

  1. Product Information - 1 record per product, contains name of product, description, etc.
    There will be roughly 3million records in here (myisam at the moment)

  2. Inventory Table - quantity information for current products at different warehouses (innodb)
    There will be rougly 1.5million records in here

I receive a ~1gb flat file (~3.5m records) from my distributor every 3 days which contains all the product information and another flatfile ~200mb from my distributor every day that contains the entire inventory (~1.5 records) at the distributor’s warehouses.

These flat files aren’t merely updates though, they contain all the data over again, just with modifications or new records. So I’m struggling with the best way to import this data quickly, and so the import doesnt affect the front end shoppers…

I first tried to tackle this problem by just writing scripts that performed LOAD DATA INFILE REPLACE… . It took about 22 minutes to import 3.3million records into my products table, but my table was locked the entire time. I tried adding the CONCURRENT option but it still was locked from read queries (I was researching and there was a bug report saying CONCURRENT was broken in 5.0.19-5.0.30?). I guess I’ll just make sure the server I put this on doesnt have that bug, or should I be doing this whole thing a different way?

As for the inventory table, I also need to do something similar, but as far as I understand because this table is innodb, the CONCURRENT flag won’t help and the entire table will be read locked while I run the LOAD DATA command. What’s the best way to replace/add records to the inventory table (innodb) without locking the entire table while I’m importing it? Do I just do it in a batch sql file? Should I do it in small bursts? Are there any tricks to doing these mass inserts or updates?

thank you guys so much your help. Hopefully the above is not too vague.

Cheers,
Ray

Hi,

You’re saying it is new data all over again ?
Why do not you load it to the table_new and then rename this table.

That should be most efficient way

Ok I like that idea. thank you so much for your help. A few other questions, how do I drop the old table and rename the shadow table to the old one’s name in one command?

Also, I was just reading about myisampack, would it make sense to run this on the shadow table before I rename it to increase performance since this table will never be edited. If so, do I need to optimize before or after packing it or does that not make a difference?

Thanks
Ray

Oh ignore my first question, I didn’t realize that rename was already atomic.
I’m still a little foggy on myisampack though. I’d like to know if it makes sense to run that on my table before renaming it to increase the read speed. And also if I need to run optimize at all either.

Thanks!

Hey Peter, thanks again, finally wrote some scripts to do the importing and it is running very fast!
I noticed in one of your articles you mentioned the LOAD INDEX INTO CACHE command. I’d like to use it on my shadow table right before renaming but unfortunately the indexes are different block sizes. I’m a little unclear on what block size is defined by. Is it a combination of col type and size? If so, would it worth increasing the size or changing the type for one of my index columns to match the other in order to take advantage of this command? The two indexes on the table are a varchar(13) (has a blocksize of 4096) and a tinyint(1) (blocksize 1024).

Thanks again for all your help,
Ray

Right,

Different types have different block sizes.

If LOAD INDEX INTO CACHE does not work you can use set of queries which scan indexes instead, which is not going to be much slower for sorted indexes.

You can do it for example by using select sum(col) from tbl where col!=const;

Such query typically would do full index scan.