Clustered index decision in a large table

Hi,

There is a table (InnoDB) with 200 million rows, like this:

id int(11) NOT NULL auto_increment,
user_id int(11) NOT NULL,
date datetime NOT NULL,
message text,
PRIMARY KEY (id),
CONSTRAINT users_fk FOREIGN KEY (user_id) REFERENCES users (id) ON DELETE CASCADE,
KEY users_date_idx (user_id,date)

This table is very INSERT and SELECT intensive, with few UPDATES and DELETE statements.

I am having trouble with AUTO_INC locks at INSERTs, but I think migrating to 5.1.23 can help this one (right?)

Also, I would like to improve SELECT speed without sacrificing INSERT’s too much.

The SELECT is basically “SELECT text FROM t WHERE user_id = n ORDER BY date DESC”;

I am wondering about changing the primary key to [user_id, date, random_number] and remove the id and the secondary index. The INSERTs are random concerning “user_id”, so I am affraid that it will slow down them. However, this could improve SELECT performance.

Could someone post some advice on this change, having in mind that both INSERTs and SELECTs are important? Thanks!!!

Yes concurrent innodb inserts should be faster with 5.1.23.

How common is it that one user_id has more than one insert per second?
If that basically never happens then you can go right ahead and drop the id column and create the primary key on (user_id, date) and skip the random column (which is not such a good idea).
(if you have rare cases with conflicting (user_id, date) then you could just insert with a value of one second later, since usually a second here or there is not so important).

As for inserts going slower due to random user_id I think it is unlikely since there is one less index to update.
And the fact that you get 4*200 million (no id column) = 800MB less data in the database should also make you gain some speed.

There are about 1.3 million “users” (actually it’s not a user, but it’s a good analogy). There is one insert (multiple rows) per hour for each user, but there are about 50~100 threads doing inserts concurrently, and the table grows fast.

Good tip about the seconds in the date.

I am afraid about the downtime this migration will need. Because it’s a PK change, MySQL will have to re-cluster the table, right? Is it better to export/import the table in this case?

Thanks!

Yes I think export/import is probably the better solution and when you do, export the data in user_id, date order.
Because then the import don’t have to write anywhere else but at the end of the table and InnoDB has an optimization for that and since you already have that index in place in the current table it should be feasible.

But I would suggest that you start with exporting/importing a smaller batch like maybe 100,000 rows or something to begin with just so you can get an estimation on how long time the entire table will take.

To minimize downtime if you only insert once an hour then you could put a read lock on the table before you start exporting.
Then you import the data into a new table while the old one is still available for reading.
When the entire import is complete you rename the original table to something old.
And then rename the new table to the original name.
And then you release the read lock.
After this you can delete the old table.

This works since the renames are very fast.

Only disadvantage with this approach is that the InnoDB tablespace will grow to be twice the needed size since it will store the duplicate table.
But that is maybe not a problem?