How to speed up a simple query?

Hello,

We have the following 2 tables (and indexes) in our database:

Table 1

CREATE TABLE linkentityarticle(id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, article_id INT UNSIGNED NOT NULL, entity_id INT UNSIGNED NOT NULL, score TINYINT UNSIGNED NOT NULL) Engine = InnoDB;

Indexes

  1. entity_id
  2. entity_id + article_id
  3. article_id + entity_id
  4. score

Total # rows in table: 10 million
Total space used by table: 1.5 GB

Table 2

CREATE TABLE entity(id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, node VARCHAR(255)) Engine = InnoDB;

Indexes

  1. node

Total # rows in table: 600,000
Total space used by table: 88 MB

When I run the following query against these tables, it runs for over 30 seconds:

SELECT DISTINCT article_id

FROM linkentityarticle d
INNER JOIN
(
SELECT id FROM entity
WHERE node IN(‘sports’)

) AS T1
ON T1.id = d.entity_id
ORDER BY d.article_id DESC
LIMIT 50;

If I remove ORDER BY d.article_id DESC, it takes 0.3 seconds (i.e. 100 fold improvement).

Below is the EXPLAIN PLAN WITH ORDER BY

id: 1
select_type: PRIMARY
table:
type: system
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
Extra: Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: PRIMARY
table: d
type: ref
possible_keys: linkentityarticle_entity_id,index_linkentityarticle_entity_i d_article_id
key: linkentityarticle_entity_id
key_len: 4
ref: const
rows: 6708
Extra: Using where
*************************** 3. row ***************************
id: 2
select_type: DERIVED
table: drift_entity
type: const
possible_keys: unique_entity_node
key: unique_entity_node
key_len: 257
ref:
rows: 1
Extra: Using index

Below is the EXPLAIN WITHOUT ORDER BY:

id: 1
select_type: PRIMARY
table:
type: system
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
Extra: Using temporary
*************************** 2. row ***************************
id: 1
select_type: PRIMARY
table: d
type: ref
possible_keys: linkentityarticle_entity_id,index_linkentityarticle_entity_i d_article_id
key: linkentityarticle_entity_id
key_len: 4
ref: const
rows: 6708
Extra:
*************************** 3. row ***************************
id: 2
select_type: DERIVED
table: drift_entity
type: const
possible_keys: unique_entity_node
key: unique_entity_node
key_len: 257
ref:
rows: 1
Extra: Using index

We are using MySQL version 5.0.44 and our InnoDB Buffer Pool is set to 2GB.

Is there anything that can be done to speed up this query when ORDER BY is used (as we need to use it to get the latest data).

Thank you in advance!

If you omit the sort and limit, how many rows are in the result set? It should be quite a few if sorting takes that long. You should also be able to re-write that query to remove the inner select in favor of an inner join, though I’m not sure if that will help any, but you should get a different query plan out of EXPLAIN.

Before doing anything else, seriously consider switching that BIGINT to INT. It will shrink the size of the table and all the indexes considerably, and make everything you do with that table faster.

LIMIT makes your second query fast. Your first index is redundant since it is contained in your second index. I would change the lay-out of your first table to make use of a clustered index:

Table 1

CREATE TABLE linkentityarticle(
article_id INT UNSIGNED NOT NULL,
entity_id INT UNSIGNED NOT NULL,
score TINYINT UNSIGNED NOT NULL)
Engine = InnoDB;

Indexes

  1. UNIQUE article_id+entity_id
  2. entity_id + article_id
  3. score

Table 2

CREATE TABLE entity(
id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
node VARCHAR(255)
) Engine = InnoDB;

Indexes

  1. node

SELECT DISTINCT article_id
FROM linkentityarticle d
WHERE id IN(SELECT id FROM entity e WHERE node IN(‘sports’) AND e.id = d.entity_id)
ORDER BY d.article_id DESC
LIMIT 50;

To check whether this helps, try this query first:

SELECT DISTINCT article_id
FROM linkentityarticle d FORCE INDEX(name of article_id + entity_id index)
WHERE id IN(SELECT id FROM entity e WHERE node IN(‘sports’) AND e.id = d.entity_id)
ORDER BY d.article_id DESC
LIMIT 50;

vgatto,

  • Ommitting LIMIT I get 94383 rows as a result
  • Rewriting the query to use the INNER JOIN I do get a slightly different explain plan, but the performance of the query did not improve
  • I don’t prefer to change BIGINT to INT, as we might run out of integer space. We had it as INT in another system, and we ran out of space in ~ 2 years, and had to change to BIGINT. So although we could keep it as INT for some time, I prefer to come up with some permanent solution to this query optimization (i.e. not something which would only last until we run out of INT space)

gmouse,

  • You are right. First index was redundant. I dropped it, but it did not help the speed of the query

  • I cannot change the layout of the table, as there are many stored procedures which use the id field we have in this table

It seems to me that filesort is slowing down the query. Is there anything that can be done to make this sorting faster?

Thanks a lot!

What about

SELECT DISTINCT article_id
FROM linkentityarticle d FORCE INDEX(name of article_id + entity_id index)
WHERE id IN(SELECT id FROM entity e WHERE node IN(‘sports’) AND e.id = d.entity_id)
ORDER BY d.article_id DESC
LIMIT 50;

+explain

gmouse,

I tried the suggestion, and performance got worse. It now ran for 1 min and 12 seconds. Rows field shows that it searched through the whole table now [Interestingly, it did not use filesort now]. Below is the explain:

*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: d
type: index
possible_keys: NULL
key: unique_linkentityarticle_article_id_entity_id
key_len: 8
ref: NULL
rows: 9913948
Extra: Using where; Using index
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: e
type: const
possible_keys: PRIMARY,unique_entity_node
key: unique_entity_node
key_len: 257
ref: const
rows: 1
Extra: Using index

I see. Actually it stops after 50 rows are found but appearantly this can take a very long time for seldomly used node names. The only way to improve running time is by having an index on (node,article_id), but then you must have a table that contains both these fields.

gmouse,

Thank you for the suggestions.

I am currently creating such a table - just to test things out - if results are good I’ll consider implementing it in production.