HUGE Tables and Indexes

I have been given the task of loading 6.5billion records (yes, that is not a typo, six billion records) into mysql database running on Ubuntu 8.04.

I have successfully done so using:
load data local infile ‘FILENAME’ into table TABLENAME FIELDS TERMINATED BY ‘,’;

This took around 10 hours.

I used the following procedure to create the INDEXES:

Create the data without indexes
Move (rename) the table.MYD file to a backup
Create an empy MYD file using TRUNCATE TABLE
Add indexes
Move the backed up MYD file back to table.MYD
Run to fix:
myisamchk -q -r -O key_buffer=4096M -O sort_buffer=2048M -O read_buffer=640M -O write_buffer=640M

This procedure took around 15 HOURS

Once complete file sizes:
MYD → 410Gig
MYI → 200Gig

Server SPEC:
DELL 2950 Dual Quad Core 2.5Mhz
32Gig RAM
Data on 7 x 7200kRPM (STRIPE) iSCSI Connected Disks.

At the end of the project I will have 10 times the amount of data to load.

I would be interested to hear comments/ideas on ways to improve the load and index time. Or any other suggestions working with this quantity of data.



It may be possible, depending on the nature of the data, to shard it to multiple tables (and servers). Doing that should decrease indexing time, and maybe slightly insertion time.

Is there a reason you’re using MyISAM? Or would you be able to look into other storage engines?

Yes, I could consider using another storage engine.
What would you suggest?
Can you elaborate a little on the sharding?

So, You need to import data and indexes. You could create temporary table without any index and main table with indexes. After that, You import rows to the temporary table. This may take a while. Then You create script that fetches some rows from temporary table and puts them to the main table. You could run that script manually, but cron should do the trick. For example:

          • /path/to/the/php /path/to/the/script
            will run importing script every minute.
            This should be faster than playing with files and fixing indexes.


  1. Why would that be faster?
  2. Can you explain the import script?
  3. Why would I need to cron it, would this not be a one time action?

will this data be modified?
consider the Archive engine.