Optimizing table... Copy to tmp table slow down

Hi,

I’m working on a project where I need to find out what number of items one user has in common with another user.

I’ve simplified my tables while I’m working on optimizing these queries. Here they are:

CREATE TABLE users ( id int(10) unsigned NOT NULL auto_increment, PRIMARY KEY (id)) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;CREATE TABLE items ( id int(10) unsigned NOT NULL auto_increment, PRIMARY KEY (id)) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;CREATE TABLE items_to_users( id int(10) unsigned NOT NULL auto_increment, item_id int(10) unsigned NOT NULL, user_id int(10) unsigned NOT NULL, PRIMARY KEY (id), UNIQUE KEY item_id_2(item_id,user_id)) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;

There may be millions of users, millions of items, and, of course, millions of rows in items_to_users. For now, I generated a temporary dataset of 10000 users, 10000 items, and 2 million items_to_users.

Here are the queries I’m using to find users (the top 100) with similar items:

CREATE TEMPORARY TABLE temp_items as SELECT items_to_users.id,item_id,user_id FROM items_to_users WHERE user_id = SOME_USER_ID;SELECT m1.user_id, m2.user_id, count(*) AS in_common FROM temp_items AS m1 INNER JOIN items_to_users AS m2 ON m1.item_id = m2.item_id AND m1.user_id < m2.user_id GROUP BY m1.user_id, m2.user_id HAVING in_common > 0 ORDER BY in_common DESC limit 100;

An explain of the second query shows that it makes a temporary table (I guess due to the group by coming from different tables?) and that step (Copy to tmp) takes a really long time compared to the rest of the query. Is there any way I can cut this time down? Or optimize it?

Thanks for your help! )
Tom

This is a difficult problem. It’s basically an N-dimensional proximity search. I can’t think of a really efficient way to do this in a database – any database – but you might try some kind of fuzzy matching to get you closer, such as creating a signature of a person’s preferences. For example, are the items categorizable? Maybe you can assign each user a string of bits that reflects whether they have an item in a given category, and then do an initial match on this string. Within the results, do an exact match (if needed) to get the final set of users. If I think of anything else I’ll post more.