Obtaining PK of Aggregate (MAX/MIN) Function Value

Alright, here’s my problem:

I have some (very) large tables (tens of millions of rows) that contain statistical data. I am trying to create aggregate tables to allow us to provide quick responses on a web application.

The structure of the data table is basically:

userId
entryId
statA
statB
statC

Where the PK is (userId, entryId).

The structure of the aggregates table is basically:

userId
entries
sumStatA
maxStatA
maxStatAEntryId
minStatA
minStatAEntryId

… and so on for the other statistics (there are about 40 total stats, so 200 columns in the aggregates table, give or take).

Filling in the values for sumStatA, maxStatA, and minStatA is trivial and I can do it very quickly.

I am perplexed about the maxStatAEntryId, though. I’ve tried a number of solutions and finally came up with a subquery that worked:

SELECT MAX(q.entryId)
FROM stats q
WHERE q.userId = p.userId
AND q.statA =
(
SELECT MAX(statA)
FROM stats r
WHERE r.userId = q.userId
LIMIT 0, 1
)
LIMIT 0, 1

… where p, q, and r are instances of the stats table (p is the one that I am selecting from in my main query.

That will give me the most recent entry in the stats table, which is what I want.

The problem is that when I have this subquery in my query some 100 times (because it’s in there twice for every stat), it takes forever to run.

I’m puzzled because I can perform a

SELECT * FROM stats WHERE userId = [whatever]

very quickly (not even one second; it’s InnoDB and I’m using the clustered indexing to my advantage).

I tried this same query on a table with 30 rows in it and it still takes over 10 minutes.

Does anyone have any suggestions on a better way to do this? I appreciate any help. I’m rarely stumped, but I’ve been working on this problem for quite some time (months… on and off).

Worst case scenario I’ll just pull the data into PHP and process it outside of the database server, but that isn’t really an elegant way to do this, and it prevents me from implementing several triggers that I need to implement (that are not directly related to this).

I can provide other information… the whole query or execution plans, but both are rather large (hundreds of lines).

If I read your query correct you want the highest entryid for the highest statA for a certain userid.

I would suggest this:

( SELECT a.entryId FROM stats a WHERE a.userId = p.userId ORDER BY a.statA, a.entryId LIMIT 1)

But make sure that you have an index like:

ALTER TABLE stats ADD INDEX stats_ix_userid_stata_entryid (userId, statA, entryId);

But beware of possible long creation time for this index, but I guess you have noticed that before right? :wink:

The reason behind this is that since the index begins with the column in the WHERE clause and the two following columns are the ones part of the ORDER BY, it can use this index for both the WHERE and the ORDER BY, and since the column you are selecting is also part of the index the DBMS doesn’t even have to query the table and you save yourself one extra seek.

You read my query correctly and your query is at least twice as efficient. I think I had a query almost exactly like that a while ago, but due to bugs in other portions of the query I had changed it… once again we show that having another set of eyes look at the problem can help find the obviously simple solution.

As for the index, we’re going commando for this query. There are over 60 fields for which we are performing this operation, and 60 indexes on an InnoDB table with a nine byte PK would kill our insertion rate and take too much disk space. It’s actually why we chose InnoDB in the first place–to utilize the clustered indexing and ensure we can get rapid access to SELECT * FROM stats WHERE userId = x for this sort of query (and others…).

I’ll see what kind of an improvement I can get with your suggestion and I’ll be sure to post back here with the final solution we go with.

Thanks a lot.

James

OK, without seeing the whole query and your data I buy parts of it.

But if you in this query have just some sub queries like this one, you should consider an index to solve each sub query.
This since these queries will be evaluated and searched for each row of the main table.
So if you in your main table of the query has for example 100,000 matching rows and the sub queries match 1000 rows, then not having an index for the query above will mean that the server has to sort 100,000,000 rows due to that sub query.
That is a lot of sorting compared to 100,000 index lookups.

But it’s your compromise between insert and selects performance and I don’t know the query pattern that your application has.

Good luck!