Alright, here’s my problem:
I have some (very) large tables (tens of millions of rows) that contain statistical data. I am trying to create aggregate tables to allow us to provide quick responses on a web application.
The structure of the data table is basically:
userId
entryId
statA
statB
statC
Where the PK is (userId, entryId).
The structure of the aggregates table is basically:
userId
entries
sumStatA
maxStatA
maxStatAEntryId
minStatA
minStatAEntryId
… and so on for the other statistics (there are about 40 total stats, so 200 columns in the aggregates table, give or take).
Filling in the values for sumStatA, maxStatA, and minStatA is trivial and I can do it very quickly.
I am perplexed about the maxStatAEntryId, though. I’ve tried a number of solutions and finally came up with a subquery that worked:
SELECT MAX(q.entryId)
FROM stats q
WHERE q.userId = p.userId
AND q.statA =
(
SELECT MAX(statA)
FROM stats r
WHERE r.userId = q.userId
LIMIT 0, 1
)
LIMIT 0, 1
… where p, q, and r are instances of the stats table (p is the one that I am selecting from in my main query.
That will give me the most recent entry in the stats table, which is what I want.
The problem is that when I have this subquery in my query some 100 times (because it’s in there twice for every stat), it takes forever to run.
I’m puzzled because I can perform a
SELECT * FROM stats WHERE userId = [whatever]
very quickly (not even one second; it’s InnoDB and I’m using the clustered indexing to my advantage).
I tried this same query on a table with 30 rows in it and it still takes over 10 minutes.
Does anyone have any suggestions on a better way to do this? I appreciate any help. I’m rarely stumped, but I’ve been working on this problem for quite some time (months… on and off).
Worst case scenario I’ll just pull the data into PHP and process it outside of the database server, but that isn’t really an elegant way to do this, and it prevents me from implementing several triggers that I need to implement (that are not directly related to this).
I can provide other information… the whole query or execution plans, but both are rather large (hundreds of lines).