How to cache part of a query

I am new to MySql and I face some… unusual issues, so any help would be appreciated.

The tables are:

  • articles (article_id, title, body, lastupdate, …)
  • clusters (cluster_id, article_id)
  • clustercats (cluster_id, category, subcategory, …)

…with obvious content.

The clusters and articles tables have a few million records and the clustercats about one million.
There are indexes for fulltext(title,body), category and all IDs.
The indexes just fit in memory for the time being but, obviously, sooner or later they won’t.

Here is a (simplified) query:

SELECT c.cluster_id FROM clusters c
INNER JOIN articles a ON a.article_id=c.article_id
INNER JOIN clustercats t ON c.cluster_id = t.cluster_id
WHERE a.lastupdate BETWEEN ‘2008-12-28 00:00:00’ AND ‘2008-12-28 00:30:00’
AND t.category = ‘TEST1’
AND MATCH(title, body) AGAINST(’ +TEST2’ IN BOOLEAN MODE)
GROUP BY c.cluster_id
ORDER BY IFNULL(rank,10000), MAX(a.lastupdate) DESC
LIMIT 0,200

This query runs about 500 (and increasing) times every half an hour, with different parameters for TEST1 and TEST2.

BUT the range of lastupdates is always BETWEEN the last run time AND current time, so it should only be used the first time the query runs and pick it from cache for the other 499 times.

In other words, a query like “SELECT c.cluster_id from clusters c, articles a where a.lastupdate BETWEEN …”, would return less than 5000 entries, that is the cluster_IDs that have been updated or inserted during the last 30 mins.

The question is: How can I retrieve from cache the cluster_IDs that belong in the lastupdate range?
Or, is there any other way to gain some speed?

I tried to use a subquery like “WHERE c.cluster_id IN (SELECT based on lastupdate range)”, but the subquery does not get cached independently - instead the performance goes worse.

Also tried to create a stored procedure to return a cluster_id range, but it seems that I can’t place it in a nested query:
SELECT c.cluster_id FROM clusters c
INNER JOIN articles a ON a.article_id=c.article_id
INNER JOIN clustercats t ON c.cluster_id = t.cluster_id
WHERE c.cluster_id IN (TimedClusters(‘2008-12-28 00:00:00’,‘2008-12-28 00:30:00’))
AND t.category = ‘TEST1’
AND MATCH(title, body) AGAINST(’ +TEST2’ IN BOOLEAN MODE)
GROUP BY c.cluster_id
ORDER BY IFNULL(rank,10000), MAX(a.lastupdate) DESC
LIMIT 0,200

Last, I tried to have the whole thing in a procedure, but there is no speed gain - it seems that the range query does not get cached, even when I used a temporary table to store the “last 30 minutes clusters”.

I still have the option to run two queries from the application layer (first SELECT the cluster_IDs in range and then run the queries on that range) but it doesn’t reasonable to have 5000 cluster_IDs in a query.

TIA!

It seems like using a Memory table might help. If I am understanding correctly, you could populate the Memory table every half hour using something like this:

SELECT c.cluster_id, a.lastupdate FROM clusters c
INNER JOIN articles a ON a.article_id=c.article_id
WHERE a.lastupdate BETWEEN ‘2008-12-28 00:00:00’ AND ‘2008-12-28 00:30:00’

Then you would only have those results in the Memory table, and could run your search against that table instead. Something like this:

SELECT m.cluster_id FROM memory m
INNER JOIN clustercats t ON m.cluster_id = t.cluster_id
WHERE t.category = ‘TEST1’
AND MATCH(title, body) AGAINST(’ +TEST2’ IN BOOLEAN MODE)
GROUP BY m.cluster_id
ORDER BY IFNULL(rank,10000), MAX(m.lastupdate) DESC
LIMIT 0,200

One other thing to try is an EXPLAIN on your queries. That will some insight on how MySQL is using (or not using) the indexes on your table.

Hopefully this helps!

[B]dsuehrin wrote on Mon, 05 January 2009 15:00[/B]
It seems like using a Memory table might help. If I am understanding correctly, you could populate the Memory table every half hour using something like this: [...]

I was not aware of Memory tables - it seems to be what I was looking for.
Time for RTFM )

Thank you!