Here’s the situation that has befuddled the optimizer and therefore produced slowly running queries:
Table Orgs
id INT UNSIGNED NOT NULL AUTOINCREMENT,
country CHAR(3)
state CHAR (2)
province VARCHAR(50)
…
INDEX indx_country (country),
INDEX indx_state (state),
INDEX indx_province (province)
PRIMARY (id)
Table users
id INT UNSIGNED NOT NULL AUTOINCREMENT
. . .
PRIMARY (id)
Table org_managers
user_id INT UNSIGNED NOT NULL,
org_id INT UNSIGNED NOT NULL,
PRIMARY(user_id, org_id)
INDEX fk_orgs (org_id) //Necessary to define a FOREIGN KEY
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE,
FOREIGN KEY (org_id) REFERENCES orgs(id) ON DELETE CASCADE
Table org_participations
id INT UNSIGNED NOT NULL,
user_id INT UNSIGNED NOT NULL,
org_id INT UNSIGNED NOT NULL,
hours FLOAT,
. . .
PRIMARY KEY (id),
UNIQUE (org_id, user_id),
INDEX fk_org_user (user_id),
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE,
Pretty basic stuff
Now, for a query that is essentially:
SELECT org_participations.user_id, COUNT(org_particpations.user_id), SUM(hours)
FROM org_participations INNER JOIN orgs ON org_participations.org_id = orgs.id INNER JOIN org_managers ON org_managers.org_id = orgs.id
WHERE orgs.country = ? AND org_managers.user_id = ? GROUP BY org_participations.user_id
Here’s the distribution of orgs.country in the orgs table
NULL 32K
USA 23K
MEX 4
CAN 2
I’ll run EXPLAIN with two different managers:
X and Y. Here’s the distribution of org_managers.user_id:
X: 198
Y: 720
Now when I do EXPLAIN on the query providing a country = ‘USA’ and user X, it’s beautiful
table: org_managers
key: user_id
key_len: 4
ref: const
rows: 198
table: orgs
key: PRIMARY
key_len: 4
ref: org_managers.org_id
rows: 1
table: org_participations
key: org_id
key_len: 4
ref: org_managers.org_id
Needless to say, this is lightening fast
Now I run EXPLAIN with user Y (the only significant difference is that Y manages 720 organizations, X 180)
table: orgs
key: index_country
key_len: 10
ref: const
rows: 482
table: org_managers
key: org_id
key_len: 4
ref: orgs.id
rows: 1
table: org_participations
key: org_id
key_len: 4
ref: org_managers.org_id
This runs 15x slower than the previous query.
Well, the problem is obviously due to the badly estimated rows based on the country index, the actual value is 23K, not 482.
I did ANALYZE TABLES and then SHOW INDEX FROM orgs
and it’s good:
key_name: index_country
column: country
cardinality: 2
and low cardinality generally means “don’t use this index”
So really 2 questions arise from this (which are opposite sides of the coin).
- How did the optimizer get the rows estimate so wrong in the second case? and where does 482 come from?
- How did the optimizer get the rows estimate exactly right in the first case?, surely it doesn’t actually execute a SELECT COUNT(*) FROM org_managers WHERE user_id = X, but otherwise how could it know that row count exactly (I realize that EXPLAIN isn’t a perfect window into the optimizer’s soul, but obviously it did something like 198 < 482, so use org_managers, and 750 > 482 so use country index on orgs!!!)
I know all the obvious fixes (IGNORE INDEX), etc. but my application is extremely dynamic, essentially users can choose the query criteria,
different accounts have very different distributions, etc. and I really don’t want to get in the situation of effectively writing a pseudo query optimizer in application code (that redefines nightmare)
My first guess is this obviously has to do with the very “lumpy” distribution of country, and/or the high # of NULLS in that distribution.
I’m assuming that you’ve seen this kind of thing before (as I have in other queries), and I thought you might be able to provide insight into what’s going on, and specifically, what I can look out for in terms of this kind of optimizer failure. Again, given the essentially infinite permutations of queries that the application might actually execute, it’s not feasible to run EXPLAIN on all of them. And second, I was wondering if there’s some API for lack of a better term (OK, short of the source code )), in which I could help the optimizer, not at the SELECT level, but maybe at the INDEX level, something like “beware lumpy distribution”
Look forward as always to your answers