For the following schema
table t1 (
c1 int not null,
c2 int not null,
primary key (c1,c2)
would it be better (i.e. faster) to import data from file using --replace or --ignore?
let’s assume that the vast majority (99%) of rows to be imported are new rows and the table has billions of rows to start with (before the import).
i am open to any storage engine if it makes a big difference.
Well --replace would require a write in addition to the check for unique value. But since you said that only 1% would be duplicates that time would probably not be noticeable.
Basically if you choose --replace or --ignore will probably not matter for the performance.
As for storage engine, choosing InnoDB with that table structure would mean that you don’t get any duplicate storage of data (for the index) since the two columns are the primary key and innodb tables are clustered on the primary key.
But on the other hand InnoDB has a bunch of hidden “columns” in the table and since you only have two int’s in the table the overhead for these hidden ones will be pretty high.
A MyISAM on the other hand would store the two columns twice, once in the table and once in the primary key together with a pointer to the position in the table. Drawback with MyISAM is that a repair in case of a corrupt table could take a very long time.
In any event the only real way to speed this up is to have so much RAM that the entire table is more or less available in RAM to avoid random reads from the disks.