Any suggestions to a query with multiple joins?

I’m trying to get my query to run quicker if possible. The query I have is:

SELECT COUNT( AS collectionCount FROM collection_categories AS p2 , collection_categories AS n2, collections c, collection_items d, commerce_items e WHERE n2.lft BETWEEN p2.lft AND p2.rgt AND n2.approved=1 AND p2.lft > 0 AND (n2.categoryid = c.categoryid) AND (p2.categoryid = 40) AND (d.collectionid=c.collectionid) AND (e.itemid=d.itemid) AND AND e.state < 2 AND UNIX_TIMESTAMP(dateExpires)-UNIX_TIMESTAMP(now()) > 0;

The explain table shows:

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE p2 ref PRIMARY PRIMARY 4 const 1 Using where
1 SIMPLE e ref itemid,state,commerce_idx commerce_idx 2 const 19375 Using where
1 SIMPLE d eq_ref PRIMARY,collectionid PRIMARY 4 collecti_collectica.e.itemid 1
1 SIMPLE c eq_ref PRIMARY,categoryid,cat_coll,last_upd PRIMARY 4 collecti_collectica.d.collectionid 1
1 SIMPLE n2 ref PRIMARY,approved PRIMARY 4 collecti_collectica.c.categoryid 1 Using where

I’m trying to get the query to run < 1 s where it’s now taking 1.5 s for only 50k records (which is not scalable).

Any suggestions would be really appreciative.

One change I’d make is to reword that last portion of the WHERE clause. The way you’ve got it right now, it’s forcing MySQL to calculate the value of UNIX_TIMESTAMP(dateExpires)-UNIX_TIMESTAMP(now()) for every record. It would be much more straightforward to simply compare the datetime field itself, like this:

… AND dateExpires > NOW()

I don’t know all the ins and outs of MySQL’s optimizer, so it’s possible that it’s smart enough to realize what you’re trying to do and rearrange it for you, but I wouldn’t bet on it. I bet you’ll see a speed increase if you word it as I suggested. Give that a try and see if it helps.

[B]tnguyen wrote on Tue, 26 June 2007 15:22[/B]

SELECT COUNT( AS collectionCount FROM /* 1 / collection_categories AS p2, / 2 / collection_categories AS n2, / 3 / collections c, / 4 / collection_items d, / 5 / commerce_items e WHERE / 1 / n2.lft BETWEEN p2.lft AND p2.rgt AND / 2 / n2.approved=1 AND / 3 / p2.lft > 0 AND / 4 / (n2.categoryid = c.categoryid) AND / 5 / (p2.categoryid = 40) AND / 6 / (d.collectionid=c.collectionid) AND / 7 / (e.itemid=d.itemid) AND / 8 / AND / 9 / e.state < 2 AND / 10 */ UNIX_TIMESTAMP(dateExpires)-UNIX_TIMESTAMP(now()) > 0;

Always write your queries in a readable manner while you are developing. Just because it returns the right result doesn't mean it's right. You'll have a hard time noticing your code (especially when you get distracted by "Eureka, it works!") if you cannot read it.

A few things I notice:

  1. (As was mentioned) where#10 should be dateExpires>now()
  2. As entered, from#2 is the same as from#1 and because of that…
  3. Unless I am missing something, where#1 will always be true. see: Between Function
  4. Also because of thing#2, where#4 and where#5 are equivalent to c.categoryid=40
  5. I think you may have gratuitous parenthesis in there. It doesn’t seem that you need any. This hurt the optimizer, according to Paul DuBois, IIRC.
  6. Looks like where#6 and where#7 are opportunities to use inner joins instead of cross joins. You should investigate this.
  7. You may find that breaking your joins up into a series of CREATE TEMPORARY TABLE AS SELECT… statements will give you much better performance. It seems that you are creating a Cartesian product.
  8. Keep in mind that taking a 2 table cross-join and adding a 3rd does NOT give you a 50% performance hit, or even a 100% performance hit. Effectively, you experience a [number of rows in the 3rd table] * 100% performance hit.
  9. I would suggest watching iostat while you do this query too. You do not want this query to cause disk writes, only disk reads. The real killer in these kinds of things is the disk I/O.

That’s all the time I have. I’m sure there is more, but post your new query here and we can continue.

Thanks so much Richard. Your suggestions were extremely helpful. I actually modified the query a little to be this:

count() AS count
1 / commerce_items a
2 / INNER JOIN collection_items b ON (a.itemid=b.itemid)
3 / INNER JOIN collections d ON (b.collectionid=d.collectionid)
4 / INNER JOIN collection_categories e ON (d.categoryid=e.categoryid)
5 / INNER JOIN users f ON (f.userid=d.userid)
6 / AND
7 / a.state < 2 AND
8 / dateExpires-now() > 0
9 / AND e.categoryid IN (
10 */ collection_categories e
e.categoryid=3 OR e.categoryid=250 OR e.categoryid=116 OR e.categoryid=249 OR e.categoryid=123
OR e.categoryid=122 OR e.categoryid=121 OR e.categoryid=120 OR e.categoryid=115 OR e.categoryid=134
OR e.categoryid=133 OR e.categoryid=132 OR e.categoryid=131 OR e.categoryid=130 OR e.categoryid=129
OR e.categoryid=128 OR e.categoryid=127 OR e.categoryid=126 OR e.categoryid=125 OR e.categoryid=124
OR e.categoryid=114 OR e.categoryid=119 OR e.categoryid=118 OR e.categoryid=117 OR e.categoryid=101
OR e.categoryid=106 OR e.categoryid=105 OR e.categoryid=104 OR e.categoryid=103 OR e.categoryid=102
OR e.categoryid=37 OR e.categoryid=109 OR e.categoryid=108 OR e.categoryid=107 OR e.categoryid=36
OR e.categoryid=35 OR e.categoryid=34 OR e.categoryid=90 OR e.categoryid=89 OR e.categoryid=88
OR e.categoryid=87 OR e.categoryid=86 OR e.categoryid=85 OR e.categoryid=84 OR e.categoryid=59
OR e.categoryid=33 OR e.categoryid=58 OR e.categoryid=57 OR e.categoryid=56 OR e.categoryid=55
OR e.categoryid=32 OR e.categoryid=100 OR e.categoryid=99 OR e.categoryid=98 OR e.categoryid=97
OR e.categoryid=96 OR e.categoryid=95 OR e.categoryid=94 OR e.categoryid=93 OR e.categoryid=92
OR e.categoryid=62 OR e.categoryid=61 OR e.categoryid=60 OR e.categoryid=31 OR e.categoryid=135
OR e.categoryid=91 OR e.categoryid=83 OR e.categoryid=82 OR e.categoryid=81 OR e.categoryid=80
OR e.categoryid=79 OR e.categoryid=78 OR e.categoryid=77 OR e.categoryid=76 OR e.categoryid=75
OR e.categoryid=74 OR e.categoryid=73 OR e.categoryid=72 OR e.categoryid=71 OR e.categoryid=70
OR e.categoryid=69 OR e.categoryid=68 OR e.categoryid=67 OR e.categoryid=66 OR e.categoryid=65
OR e.categoryid=64 OR e.categoryid=63 OR e.categoryid=30 OR e.categoryid=54 OR e.categoryid=53
OR e.categoryid=52 OR e.categoryid=51 OR e.categoryid=50 OR e.categoryid=49 OR e.categoryid=48
OR e.categoryid=47 OR e.categoryid=29 OR e.categoryid=253 OR e.categoryid=252 OR e.categoryid=251
OR e.categoryid=248 OR e.categoryid=110 OR e.categoryid=46 OR e.categoryid=45 OR e.categoryid=44
OR e.categoryid=43 OR e.categoryid=42 OR e.categoryid=41 OR e.categoryid=40 OR e.categoryid=39
OR e.categoryid=38 ) ;

It still runs about 2 s. Any suggestions is appreciated.


A few more things:
[] Try running: ANALYZE TABLE commerce_items, collection_items, collections, collection_categories, users; If that helps matters, you will want to make sure you re-run it regularly, based on the frequency of updates to the tables in question.
] Can wee see the result of: SHOW INDEX FROM commerce_items; SHOW INDEX FROM collection_items; SHOW INDEX FROM collections; SHOW INDEX FROM collection_categories; SHOW INDEX FROM users;
[] Can we see the result of: EXPLAIN [your new query]
] Why did you introduce the users table? (It doesn’t appear to be used anywhere)
[*] What table is dateExpires coming from? I would guess it’s table d, but you should use table.column notation for the sake of clarity. Here is the Biggie, your sub-select. I want to help you gain the skill off logically deciding what outcomes are possible, so you can then decide the most logical why to achieve them.

Pertaining only to the sub-select:
[] My first instinct would be to say that your where clause should be simplified to: where e.categoryid IN (3,250,116,249,123,122,121,120,115,134,133,132,131,130,129,1 28,127,126,125,124,114,119,118,117,101,106,105,104,103,102,3 7,109,108,107,36,35,34,90,89,88,87,86,85,84,59,33,58,57,56,5 5,32,100,99,98,97,96,95,94,93,92,62,61,60,31,135,91,83,82,81 ,80,79,78,77,76,75,74,73,72,71,70,69,68,67,66,65,64,63,30,54 ,53,52,51,50,49,48,47,29,253,252,251,248,110,46,45,44,43,42, 41,40,39,38)
] Let us take this further. If out of the big long nasty list, we can be assured that every value will be present, we could drop the entire sub-select and your#9 could simply be the IN(…) condition that I specified above. Right? Are you with me so far? However, (let us call this point 2b) if only a few of the values in the nasty list are present (like maybe “250, 100, 50”) then your#9 will only be compared to those 3 instead of all 111. (yes I counted them.) That is to say, your#9 would be logically equivalent to: AND e.categoryid IN (250,100,50) Meaning that dropping the sub-select is not safe if you are unsure as to where or not all the values are present. Let us take this further, or (should I say) look at a broader view. Since the column being compared in your#9 is the same column being tested in the where clause of your sub-select, it doesn’t matter whether my point 2b returns the a sub-set of the nasty list or the whole list. That is to say, the sub-select adds nothing of value here. Logically, if point 2b is true then it makes no difference if your#9 is “AND e.categoryid IN (250,100,50)” or “AND e.categoryid IN (big long nasty list)” it will still only evaluate true for 250, 100, and 50.

I hope you can make sense of that.

In summary, lets try:
[] change your#8 to use table.column notation.
] I still don’t like using “dateExpires-now() > 0” instead of “dateExpires>now()”, for 2 reasons. First, it is doing arithmetic and a compare, instead of doing just a compare. Next, I’m not sure that the arithmetic is valid (Set variable @n to the datetime value of exactly one day in the future via “set @n=addtime(now(),‘1 0:0:0’);”, then compare these: “select @n-now(), unix_timestamp(@n)-unix_timestamp(now()), @n>now();”, then set @n to now via “set @n=now();” and try that comparison again.) unless the values are encased in unix_timestamp() functions, which is expensive. (compare these: “select benchmark(1000000, @n>now());” and “select benchmark(1000000, unix_timestamp(@n)-unix_timestamp(now()));”) Benchmark is your friend!
[] remove your#5, unless your#8 is actually resolving to"f.dateExpires"
] consider adding an index to dateExpires. Unless that table gets frequent inserts (which are slowed slightly by the need to update indexes), you should have an index since it is being compared in a where clause.
[*] change your#9 to: AND e.categoryid IN (3,250,116,249,123,122,121,120,115,134,133,132,131,130,129,1 28,127,126,125,124,114,119,118,117,101,106,105,104,103,102,3 7,109,108,107,36,35,34,90,89,88,87,86,85,84,59,33,58,57,56,5 5,32,100,99,98,97,96,95,94,93,92,62,61,60,31,135,91,83,82,81 ,80,79,78,77,76,75,74,73,72,71,70,69,68,67,66,65,64,63,30,54 ,53,52,51,50,49,48,47,29,253,252,251,248,110,46,45,44,43,42, 41,40,39,38) Do the happy dance. I suspect it will be warranted.

Thanks so much for your help. I tried performing the serveral steps you suggested and it works great now. Only thing I’m thinking is when records get > 1M records, to do an aggregate will become expensive. The problem is I can’t “pre-cache” them because the count may be different for various combination of searches. Any suggestions?

How often do you need to run this aggregation? Is it ran as a (cron) batch, called via a web page, or other?

If it is called regularly be hits to a high traffic web app, you may be in trouble. If this is the case, you will need to create something like a one column, one row table to hold that “count” value. Then, modify the apps that are changing data in the related tables so that the count value is kept up to date. That will be a lot of fun.

It is ran as a web service. Pre-caching the data via a count value in a column probably won’t work because each search result will bring back a different “aggregate”. The catch 22 is doing the count -per-search for millions of records will be too slow.

[B]tnguyen wrote on Thu, 12 July 2007 10:46[/B]
each search result will bring back a different "aggregate"

So, from what I know so far I am assuming that each “search” will vary the values used for the categoryid in the collection_categories table. So, my suggestion is to add a column for “count per categoryid”. Then when you search for a bunch of categories, do it like this:

SELECT SUM(count) FROM collection_categories WHERE e.categoryid=3 OR e.categoryid=250 OR e.categoryid=116 OR e.categoryid=249 OR e.categoryid=123;# Notice that since count is a keyword, I must enclose it in backticks# to use it as a column name. You could avoid that by naming it# something like “cnt”.

You can then keep the count column up to date by ethier of these schemes:
[*] Any time a change is made to one of the related tables, update the count column. Use a cron to update the count column every hour (or however often you choose), and make it clear to the users that the data is up to date as of N-o’clock.

I favor the second option. The first option will involve triggers and stored procedures. I’m not sure you are ready for that.

Good Luck, be sure to post back.

Thanks Richard,
Actually, there are multiple columns that may vary in a search including 1) item description 2) category 3) price 4) expiration date…similar to ebay…creating a count for category is intuitive, but the remaining attributes can be tricky?

Have you run an explain on this query?

It’s been my experience that using non-deterministic functions like NOW(), RAND() are slow and do not scale at all.

Also, logical ORs are problematic. I know that I’m supposed to be able to use them now in 5.0.x but when I run an explain the UNION version is always faster. We’re sticking to our UNIONs for now. If you need an undeterministic number of logical ORs, then perhaps a summary table is really the way to go. Jay Pipes has some interesting & almost always helpful suggestions – esp about UNIONs and moving sub-queries to JOINs. Explain is your friend.

I agree that it makes for easier development to write out your queries neatly but explain is a must do second. why not post your explain here?