join / subquery / derived table performance

hi all,

i noticed some strange behaviour in the use of mysql. currently using 5.1.37-enterprise-gpl-advanced.

It started with the following query:

select d.Date, sum(r.FctA) a, sum(r.FctB) b
from rpo_fct_x x
join rpo_dim_date d on x.SID_DateId = d.SID_DateId
where d.Date > adddate(sysdate(), interval -2 day)
and d.Date < sysdate()
group by d.Date

explain:

±—±------------±------±-------±---------------------- --±--------±--------±--------------------------------±-- -------±--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
±—±------------±------±-------±---------------------- --±--------±--------±--------------------------------±-- -------±--------------------------------+
| 1 | SIMPLE | r | ALL | date_country,SID_DateId | NULL | NULL | NULL | 13860580 | Using temporary; Using filesort |
| 1 | SIMPLE | d | eq_ref | PRIMARY | PRIMARY | 4 | dwh_rep_olap.r.SID_DateId | 1 | Using where |
±—±------------±------±-------±---------------------- --±--------±--------±--------------------------------±-- -------±--------------------------------+
2 rows in set (0.00 sec)

mysql> show keys from rpo_fct_registrations;
±----------------------±-----------±--------------------- --------------------------±-------------±----------------- -------±----------±------------±---------±-------±----- ±-----------±--------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
±------±-----------±------------------------------------- ----------±-------------±------------------------±------- —±------------±---------±-------±-----±-----------±- -------+
| r | 1 | date_country | 1 | SID_DateId | A | 2047 | NULL | NULL | | BTREE | |
| r | 1 | date_country | 2 | SID_CountryId | A | 28876 | NULL | NULL | | BTREE | |
| r | 1 | SID_DateId | 1 | SID_DateId | A | 2047 | NULL | NULL | | BTREE | |
±----------------------±-----------±--------------------- --------------------------±-------------±----------------- -------±----------±------------±---------±-------±----- ±-----------±--------+
7 rows in set (0.00 sec)

I tried then to play around with the various parts of the query:

mysql> select count() from r where SID_DateId in (select SID_DateId from d where Date>adddate(sysdate() , interval -2 day) and Date < sysdate());
±---------+
| count(
) |
±---------+
| 13858 |
±---------+
1 row in set (1 min 24.29 sec)

far too long for this query…

±—±-------------------±---------±----------------±— -----------±-----------±--------±-----±---------±------ -------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
±—±-------------------±---------±----------------±— -----------±-----------±--------±-----±---------±------ -------------------+
| 1 | PRIMARY | r | index | NULL | SID_DateId | 4 | NULL | 13860580 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | d | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using where |
±—±-------------------±---------±----------------±— -----------±-----------±--------±-----±---------±------ -------------------+

then i put the two values into the in clause which would be the result of the in:

mysql> select count() from r where SID_DateId in (2091,2090);
±---------+
| count(
) |
±---------+
| 13858 |
±---------+
1 row in set (0.00 sec)

±—±------------±-------±------±---------------------- --±-------------±--------±-----±------±---------------- ---------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
±—±------------±-------±------±---------------------- --±-------------±--------±-----±------±---------------- ---------+
| 1 | SIMPLE | r | range | date_country,SID_DateId | date_country | 4 | NULL | 24306 | Using where; Using index |
±—±------------±-------±------±---------------------- --±-------------±--------±-----±------±---------------- ---------+

Why does the subquery takes that much time against the two values in the in clause?

i’ve rewritten then the query to use a derived table:

select count(*) from r
join (select SID_DateId from d where Date>adddate(sysdate(), interval -2 day ) and Date < sysdate()) d on d.SID_DateId = r.SID_DateId

±—±------------±-------------±-----±----------------- -------±-----------±--------±-------------±------±----- -------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
±—±------------±-------------±-----±----------------- -------±-----------±--------±-------------±------±----- -------+
| 1 | PRIMARY | | ALL | NULL | NULL | NULL | NULL | 2 | |
| 1 | PRIMARY | r | ref | date_country,SID_DateId | SID_DateId | 4 | d.SID_DateId | 6768 | Using index |
| 2 | DERIVED | d | ALL | NULL | NULL | NULL | NULL | 13150 | Using where |
±—±------------±-------------±-----±----------------- -------±-----------±--------±-------------±------±----- -------+

±---------+
| count(*) |
±---------+
| 13858 |
±---------+
1 row in set (0.04 sec)

Again an acceptable performance.

Does anyone has a tip what went wrong on the initial query and how to speed it up? Problem is I can’t work with much query changes, as the queries are build automatically by an reporting tool where the influence possibilities on the queries are very less… we’re using mysql in a data warehouse environement. Storage engine is MyISAM

Thanks
Torsten

I don’t see any evidence that you have an index on the Date column in your date dimensions table. If you had this index, and you had more than a few days in that table, your original query would have started by applying the restriction to the dimension table, and then joining into your fact table.

When using a subquery, you usually want to make sure explain regards this as a derived table, and not a dependent subquery. A dependent subquery evaluates the inner query for every row returned by the outer query. Since your outer query returns 13M rows, MySQL is executing your inner query 13M times. That takes some time. Instead of using IN, try re-writing that query as a JOIN and make sure if you use the same tables or field names that they are aliased properly so that there can be no confusion about which query (inner or outer) a column or table belongs to. So you shouldn’t use just SID_DateID, instead you should be using d.SID_DateID or r.SID_DateID where appropriate.

there is an index on Date:

mysql> show keys from rpo_dim_date;
±-------------±-----------±---------±-------------±---- --------±----------±------------±---------±-------±---- -±-----------±--------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
±-------------±-----------±---------±-------------±---- --------±----------±------------±---------±-------±---- -±-----------±--------+
| rpo_dim_date | 0 | PRIMARY | 1 | SID_DateId | A | 13150 | NULL | NULL | | BTREE | |
| rpo_dim_date | 0 | Date | 1 | Date | A | 13150 | NULL | NULL | | BTREE | |
| rpo_dim_date | 1 | DateId | 1 | DateId | A | 13150 | NULL | NULL | | BTREE | |
| rpo_dim_date | 1 | YYYYMMDD | 1 | YYYYMMDD | A | 13150 | NULL | NULL | | BTREE | |
±-------------±-----------±---------±-------------±---- --------±----------±------------±---------±-------±---- -±-----------±--------+

but it won’t help - and full join doesn’t work either… and in the initial query do you think there is something missing? for me logical the query looks fine…!?