I run a social network. Most of our features follow a similar pattern: a user → their friends → something the friends did ordered by timestamp and limited.
An example query:
SELECT pictures.* FROM pictures
INNER JOIN friends ON friends.friend_id = pictures.user_id
WHERE friends.user_id = ?
ORDER BY pictures.created_at DESC
LIMIT 20
Obviously, this sort of query is always going to filesort (and possibly create a temp table), no matter how it’s indexed (provided there’s more than one expected value for friends.user_id).
I’m no expert in MySQL performance optimization. But, I have read the book and followed mysqlperformanceblog and all that for years. Yet, I find myself stumped on optimizing a query like this.
Since your condition is on the friends table which is the second table in the JOIN. You should for clarity move the condition to the ON clause.
So the query should look something like this:
SELECT pictures.*FROM picturesINNER JOIN friends ON friends.friend_id = pictures.user_id AND friends.user_id = ?ORDER BY pictures.created_at DESCLIMIT 20
This way you can much easier see the columns involved and there is no doubt about how the optimizer can work.
And since it is an INNER JOIN the optimizer can choose the join order between the two tables.
Now since you have the condition “friends.user_id=?” and the sort order in two different tables, the only question is if it’s faster:
To start with the records from the friends table, join the pictures table and then filesort the data later.
OR
to start by retrieve the records from the pictures table in ORDER JOIN’ing them with the friends table and then apply the condition after the join.
For 1 we need a compound index on friends(user_id, friend_id) to be fastest, with that index the condition is solved on the first column in the index and then the second column contains the data needed for the JOIN, saving us an extra read from the table.
While for 2 we want to read the rows in order so in that case we need a index on the pictures(created_at) but since we are also selecting all of the columns from the pictures table we can just as well read the entire row from the table.
So the index can just be that single column and then we read all the rest of the columns from the table.
And for this scenario 2 we should optimally want the reverse index on friends( friend_id, user_id) since now the join comes before the condition evaluation.
So as you can see, depending on your data one or the other join order can be most effective and they require different indexes.
Also I can mention that if your condition limits the amount of rows a lot, a filesort is not a problem to solve pretty quickly with todays fast CPU’s.
But on the other hand if the condition matches on a lot of rows and they all have to be sorted before you LIMIT away 99% of them, then starting according to the ORDER BY can be more advantageous since after a ORDER BY has returned LIMIT amounts of rows the query stops completely and hence avoiding the other 99% of the rows.