Optimizer Woes

Here’s the situation that has befuddled the optimizer and therefore produced slowly running queries:

Table Orgs
id INT UNSIGNED NOT NULL AUTOINCREMENT,
country CHAR(3)
state CHAR (2)
province VARCHAR(50)

INDEX indx_country (country),
INDEX indx_state (state),
INDEX indx_province (province)
PRIMARY (id)

Table users
id INT UNSIGNED NOT NULL AUTOINCREMENT
. . .
PRIMARY (id)

Table org_managers
user_id INT UNSIGNED NOT NULL,
org_id INT UNSIGNED NOT NULL,
PRIMARY(user_id, org_id)
INDEX fk_orgs (org_id) //Necessary to define a FOREIGN KEY
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE,
FOREIGN KEY (org_id) REFERENCES orgs(id) ON DELETE CASCADE

Table org_participations
id INT UNSIGNED NOT NULL,
user_id INT UNSIGNED NOT NULL,
org_id INT UNSIGNED NOT NULL,
hours FLOAT,
. . .
PRIMARY KEY (id),
UNIQUE (org_id, user_id),
INDEX fk_org_user (user_id),
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE,

Pretty basic stuff

Now, for a query that is essentially:
SELECT org_participations.user_id, COUNT(org_particpations.user_id), SUM(hours)
FROM org_participations INNER JOIN orgs ON org_participations.org_id = orgs.id INNER JOIN org_managers ON org_managers.org_id = orgs.id
WHERE orgs.country = ? AND org_managers.user_id = ? GROUP BY org_participations.user_id

Here’s the distribution of orgs.country in the orgs table
NULL 32K
USA 23K
MEX 4
CAN 2

I’ll run EXPLAIN with two different managers:
X and Y. Here’s the distribution of org_managers.user_id:
X: 198
Y: 720

Now when I do EXPLAIN on the query providing a country = ‘USA’ and user X, it’s beautiful


table: org_managers
key: user_id
key_len: 4
ref: const
rows: 198


table: orgs
key: PRIMARY
key_len: 4
ref: org_managers.org_id
rows: 1


table: org_participations
key: org_id
key_len: 4
ref: org_managers.org_id

Needless to say, this is lightening fast

Now I run EXPLAIN with user Y (the only significant difference is that Y manages 720 organizations, X 180)


table: orgs
key: index_country
key_len: 10
ref: const
rows: 482


table: org_managers
key: org_id
key_len: 4
ref: orgs.id
rows: 1


table: org_participations
key: org_id
key_len: 4
ref: org_managers.org_id

This runs 15x slower than the previous query.
Well, the problem is obviously due to the badly estimated rows based on the country index, the actual value is 23K, not 482.
I did ANALYZE TABLES and then SHOW INDEX FROM orgs
and it’s good:
key_name: index_country
column: country
cardinality: 2

and low cardinality generally means “don’t use this index”

So really 2 questions arise from this (which are opposite sides of the coin).

  1. How did the optimizer get the rows estimate so wrong in the second case? and where does 482 come from?
  2. How did the optimizer get the rows estimate exactly right in the first case?, surely it doesn’t actually execute a SELECT COUNT(*) FROM org_managers WHERE user_id = X, but otherwise how could it know that row count exactly (I realize that EXPLAIN isn’t a perfect window into the optimizer’s soul, but obviously it did something like 198 < 482, so use org_managers, and 750 > 482 so use country index on orgs!!!)

I know all the obvious fixes (IGNORE INDEX), etc. but my application is extremely dynamic, essentially users can choose the query criteria,
different accounts have very different distributions, etc. and I really don’t want to get in the situation of effectively writing a pseudo query optimizer in application code (that redefines nightmare)

My first guess is this obviously has to do with the very “lumpy” distribution of country, and/or the high # of NULLS in that distribution.

I’m assuming that you’ve seen this kind of thing before (as I have in other queries), and I thought you might be able to provide insight into what’s going on, and specifically, what I can look out for in terms of this kind of optimizer failure. Again, given the essentially infinite permutations of queries that the application might actually execute, it’s not feasible to run EXPLAIN on all of them. And second, I was wondering if there’s some API for lack of a better term (OK, short of the source code )), in which I could help the optimizer, not at the SELECT level, but maybe at the INDEX level, something like “beware lumpy distribution”

Look forward as always to your answers

Hi,

First I should mention MySQL Optimizer can only base its decisions on the data it has, and it does not have too much available for this case - only cardinality. For skewed distributions this can provide very suboptimal plan.

Note cardinality is only used if constant is dynamic - for “A=5” and similar clauses MySQL will use index lookup to estimate number of rows.

Now regarding nulls - there is an option which helps you to adjust how statistics is computed regarding NULLS:

check myisam_stats_method variable:

http://dev.mysql.com/doc/refman/5.0/en/myisam-index-statisti cs.html

[B]Peter wrote on Thu, 24 August 2006 02:46[/B]
Hi,

First I should mention MySQL Optimizer can only base its decisions on the data it has, and it does not have too much available for this case - only cardinality. For skewed distributions this can provide very suboptimal plan.

Note cardinality is only used if constant is dynamic - for “A=5” and similar clauses

I'm not sure what [I]constant is dynamic[/I] means, do you mean it only uses cardinality where join type is ref or range on tables other than the driving table?
[B]Quote:[/B]

MySQL will use index lookup to estimate number of rows.

Well, if it does an effective COUNT(*) where some_key = ? (I assume that's what [I]uses index lookup [/I]means in this case), that would explain why it computes the first query cost exactly right ([I]COUNT(*) where org_managers.user_id = X == 198[/I]), but it doesn't explain why [I]COUNT(*) where orgs.country = 'USA' [/I] is calculated as 482 when it's really 23 000!!
[B]Quote:[/B]

Now regarding nulls - there is an option which helps you to adjust how statistics is computed regarding NULLS:

check myisam_stats_method variable:

http://dev.mysql.com/doc/refman/5.0/en/myisam-index-statisti cs.html

Unfortunately, these are InnoDB tables (they have foreign keys defined)