I’m trying to optimize a query for my forums. The query is related to “my posts” functionality where a user can see a list of all the threads they’ve posted in, ordered by most recent activity. Here is a stripped down version of the query:
mysql> EXPLAIN SELECT SQL_NO_CACHE SQL_CALC_FOUND_ROWS DISTINCT Main.B_Number → FROM BB_Posts AS Main, BB_Posts AS Reply → WHERE Main.B_Number = Reply.B_Main → AND Reply.B_PosterId = 654 → AND Reply.B_Board IN (‘1’, ‘2’, ‘3’) → ORDER BY Main.B_Last_Post DESC;±—±------------±------±-------±--------------±--------±--------±------------------------±-----±---------------------------------------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |±—±------------±------±-------±--------------±--------±--------±------------------------±-----±---------------------------------------------+| 1 | SIMPLE | Reply | ref | ID_ndx | ID_ndx | 4 | const | 7872 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | Main | eq_ref | PRIMARY | PRIMARY | 4 | WWWThreads.Reply.B_Main | 1 | | ±—±------------±------±-------±--------------±--------±--------±------------------------±-----±---------------------------------------------+
Here’s a dump of a small subset of the table: example.sql
(P.S. I know the structure is kind of ugly eg. all the enums and some of the indexes, it’s due to attempts at backwards-compatible optimization.)
I’ll pay $100 via Paypal if someone can suggest an alternate query or an index which gets rid of the temporary + filesort without affecting the results.
Thanks in advance for any help, this is driving me crazy. confused:
SELECT Main.B_NumberFROM BB_Posts AS MainWHERE Main.B_Number IN ( SELECT Reply.Number FROM BB_Posts AS Reply WHERE Reply.B_PosterId = 654 AND Reply.B_Board IN (‘1’, ‘2’, ‘3’))ORDER BY Main.B_Last_Post DESC;
Thank you so much for your prompt response! I really appreciate your taking the time to look at this for me. ) I actually did try a similar approach using subqueries… although it eliminates the temporary table, it must perform a filesort on the entire posts table which unfortunately is significantly slower. For reference, the original query usually takes between 2 - 4 seconds. The new query hangs on “preparing” for 30+ seconds and I have to kill it before it completes.
±—±-------------------±------±----------------±--------------±--------±--------±-----±--------±----------------------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |±—±-------------------±------±----------------±--------------±--------±--------±-----±--------±----------------------------+| 1 | PRIMARY | Main | ALL | NULL | NULL | NULL | NULL | 5079927 | Using where; Using filesort | | 2 | DEPENDENT SUBQUERY | Reply | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using where | ±—±-------------------±------±----------------±--------------±--------±--------±-----±--------±----------------------------+
I messed around with different index setups, but even using FORCE INDEX the main table always examines all 5 million rows. I may be missing something, since I don’t have a lot of experience with subqueries (we used mySQL 3 until recently). However, at least as written, this approach seems too slow to be practical.
Well, my query was wrong anyway, it should have been:
SELECT Main.B_NumberFROM BB_Posts AS MainWHERE Main.B_Number IN ( SELECT DISTINCT Reply.B_Main FROM BB_Posts AS Reply WHERE Reply.B_PosterId = 654 AND Reply.B_Board IN (‘1’, ‘2’, ‘3’))ORDER BY Main.B_Last_Post DESC;
I don’t think that will help with your filesort, though.
alter table BB_Posts add index new_index3 (B_Last_Post, B_PosterId, B_Board);
Now slightly modify your query:
SELECT BB_Posts.B_Number FROM BB_Posts FORCE INDEX (new_index3) WHERE BB_Posts.B_Last_Post = 1 and BB_Posts.B_PosterId = 654 and BB_Posts.B_Board in (1,2,3) ORDER BY BB_Posts.B_Last_Post desc;
Note the ‘FORCE INDEX’
Now explain:
±—±------------±---------±------±--------------±-----------±--------±-----±-----±------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |±—±------------±---------±------±--------------±-----------±--------±-----±-----±------------+| 1 | SIMPLE | BB_Posts | range | new_index3 | new_index3 | 9 | NULL | 3 | Using where |±—±------------±---------±------±--------------±-----------±--------±-----±-----±------------+
Thanks so much guys for your continued assistance. Jon unfortunately your example isn’t exactly what I’m trying to do, but it’s my fault for not explaining it better. I’ve been working with this stupid BB for so long that I forgot people aren’t born with an inherent knowledge of its data structure.
The complication arises from the fact that the BB supports threaded discussion, and each branch will have different last post info. Please consider these two threads:
If I’m user #123 and I want to retrieve the main post # for threads I’ve posted in ordered by most recent activity, the correct order is:
B_Main = 2 (Last post = 1200000999)
B_Main = 1 (Last post = 1200000888)
However if I do something like “SELECT B_Main FROM Posts WHERE B_PosterId = 123 ORDER BY B_LastPost DESC”, I’ll end up with this:
B_Main = 1 (Last post = 1200000888)
B_Main = 2 (Last post = 1200000002)
(Note that in this example the main post ID order corresponds with the last post stamp order, but that can’t be assumed.)
Anyway that’s why my query joins the posts table on itself, so I can get the true last post stamp from the thread’s parent, not the last post stamp for whatever particular branch of the thread the user posted in. The alternative is to add a field like B_ThreadLastPost or something, which would contain the same value for each post in a thread: the stamp of the very last post made. However, unless I have no choice I hate to throw out a working query which is almost good enough, duplicate the same data unnecessarily, and deal with populating this field + modifying the forum to use it.
I hope maybe this is a little more clear?
Anyway I get the sinking feeling this may not be possible with the existing structure and I’ll have to add an additional field. But I hope I’m wrong and I’m still interested in any fresh insight into the problem. I’ll still pay $25 to you guys who are making a good faith effort to help, even if the solutions aren’t exactly what I need.
Hierarchical data is a bitch to query efficiently.
The problem is that you can only use an index to select the subset of records that you want to sort OR to perform the sort, not both. If there are a large number of records returned, you’re stuck with having to sort them all sans index.
I can’t accept payment for the little assistance I might have provided. I’ve been helped several times on these forums myself, so I guess I’m just trying to give back.