Very poor performance with select count(*)

Hello, this is my first post )

I’m developing a log analysis application and I’m having a performance problem that I was hoping that someome could help (thanks in advance)

I have table QUERY that has almost 3 million rows. I also have a table Location containing rows for most cities of the world. The QUERY table has a LID (location id) that matches a specific location).

Although the complete query is not this simple, I’ve discovered where the performance problem arises.

The query is as simple as this:

SELECT QUERY, COUNT(*) AS FREQUENCY
FROM QUERY
where lid=2703
GROUP BY QUERY ORDER BY FREQUENCY DESC LIMIT 10

When I run this simple query with other lid values it is quite fast, but I discovered that a very large percentage of the queries has lid=2703 (over 1.8 million in almost 3 million). When I run the query, the results take over 200sec to be processed.

This is the table:

CREATE TABLE QUERY (
QID INT UNSIGNED NOT NULL, (query id)
SID INT UNSIGNED NOT NULL, (session id)
QUERY VARCHAR(255) NOT NULL,
LID INT UNSIGNED, (location id)
PRIMARY KEY(QID)
)ENGINE = MYISAM;

SHOW INDEXES FROM QUERY

table |Non_unique |key_name |seq_in_index |Column_name |Collation |Cardinality |Sub_part |Packed |Null |Index_type

‘query’, 0, ‘PRIMARY’, 1, ‘QID’, ‘A’, 2387075, , ‘’, ‘’, ‘BTREE’, ‘’
‘query’, 1, ‘QUERYSID_NDX’, 1, ‘SID’, ‘A’, 596768, , ‘’, ‘’, ‘BTREE’, ‘’
‘query’, 1, ‘QUERYLID_NDX’, 1, ‘LID’, ‘A’, 132615, , ‘’, ‘YES’, ‘BTREE’, ‘’
‘query’, 1, ‘QUERYQ_NDX’, 1, ‘QUERY’, ‘A’, 341010, , ‘’, ‘’, ‘BTREE’, ‘’
‘query’, 1, ‘QUERY_NDX_FULL’, 1, ‘QUERY’, ‘’, 140416, , ‘’, ‘’, ‘FULLTEXT’, ‘’

(At first I had only the full index for column query, then I added the “normal” B-tree index)

-----EXPLAIN

1, ‘SIMPLE’, ‘QUERY’, ‘index’, ‘QUERYLID_NDX’, ‘QUERYQ_NDX’, ‘767’, ‘’, 89, ‘Using where; Using temporary; Using filesort’

How can I display the top 10 queries for some location OR many locations, with good performance?

Could you please help?
I know the word “urgent” is very used by this is the case…
Thx

have you tried using a multiple-column index on lid,query?

try:

alter table frequency add index(lid,query);

then:

select query,count(query) as frequency
from query
where lid=2703 group by query order by frequency desc limit 10

Thank you very much for both your quick response and proposed solution!

It actually improved the performance!

I’ve used the code you provided:

alter table query add index(lid,query);

NOTE: for understanding the EXPLAIN that follows, the name of the new multiple index was created with the dedault name LID.

SELECT QUERY, COUNT(*) AS FREQUENCY
FROM LOCATION
JOIN QUERY ON QUERY.LID=LOCATION.LID
WHERE city=‘Mountain View’
GROUP BY QUERY ORDER BY FREQUENCY DESC LIMIT 10

takes 85.5sec

When I run the this query I realize, using Explain, that the optimizer doesn’t choose the new LID index.

–Explain
1, ‘SIMPLE’, ‘LOCATION’, ‘ref’, ‘PRIMARY,LOCATION_CITY_NDX’, ‘LOCATION_CITY_NDX’, ‘153’, ‘const’, 1, ‘Using where; Using temporary; Using filesort’

1, ‘SIMPLE’, ‘QUERY’, ‘ref’, ‘QUERYLID_NDX,LID’, ‘QUERYLID_NDX’, ‘5’, ‘logsbnfull.LOCATION.LID’, 18, ‘Using where’

NOTE: The value referred in the previous post (2703) is from city Mountain View.

So I hinted the optimizer to choose the LID index

SELECT QUERY, COUNT(*) AS FREQUENCY
FROM LOCATION
JOIN QUERY USE INDEX (LID) ON QUERY.LID=LOCATION.LID
WHERE city=‘Mountain View’
GROUP BY QUERY ORDER BY FREQUENCY DESC LIMIT 10

takes 50.28sec

– EXPLAIN
1, ‘SIMPLE’, ‘LOCATION’, ‘ref’, ‘PRIMARY,LOCATION_CITY_NDX’, ‘LOCATION_CITY_NDX’, ‘153’, ‘const’, 1, ‘Using where; Using temporary; Using filesort’

1, ‘SIMPLE’, ‘QUERY’, ‘ref’, ‘LID’, ‘LID’, ‘5’, ‘logsbnfull.LOCATION.LID’, 18, ‘Using where; Using index’

Can I do anything more to improve performance (since queries like these are only a small part of a results page and summing many times like these will be slow).

Thanks again for your great reply! )

Nelson

try doing a count(query) instead of a count(*)

if your table doesn’t change often, have you considered using a query_cache?

try:
show variables like ‘query_cache%’

if you’re feeling ambitious, maybe a redesign or additional table could be in order; if you only care about aggregation why not store the count as a column that gets incremented when you execute a query? This would give you instant aggregation results, and if this table was merely added to your current schema, it shouldn’t affect performance.

CREATE TABLE QUERY_count (
QUERY VARCHAR(255) NOT NULL primary key unique,
query_count int unsigned not null default ‘0’,
LID INT UNSIGNED,
index(query_count),
index(LID)
)ENGINE = MYISAM;

insert ignore into query_count (query) vlaues(‘select’);
update QUERY_count set query_count = query_count + 1 where query=‘select’;

YOU REALLY LIKE CAPS LOCK )

try doing a count(query) instead of a count(*)
bollocks, since query is NOT NULL, it is exactly the same query.

Try this:

SELECT QUERY, COUNT(*) AS FREQUENCY
FROM QUERY
WHERE QUERY.lid=(SELECT lid FROM LOCATION WHERE city=‘Mountain View’)
GROUP BY QUERY ORDER BY FREQUENCY DESC LIMIT 10

And include HAVING COUNT() > 1 if there are many COUNT()'s equal to 1 that are guaranteed to never be in the output.

Do you really need VARCHAR(255)? Shorter length will speed up your query.

gmouse > why would a shorter varchar speed up the query? I always thought that the length was just a semantic limit on length?

[B]c113345@tyldd.com wrote on Mon, 24 August 2009 17:49[/B]
gmouse > why would a shorter varchar speed up the query? I always thought that the length was just a semantic limit on length?
It will not speed up my query if it is properly indexed, but it will speed up yours. Google for Generosity Can Be Unwise

Thank you for your help! I’m still having performance issues…

I have:
-a table with user queries (QUERY)
-a table with locations (LOCATIONS)
-a table that relates locations and time with queries (LOC_TIME_QUERIES)

Among many other things, I need to create a frequency chart based on the three levels (query, time, location). For every point of the chart I’m using the following query to get the frequency:

SELECT COUNT(QUERY) AS FREQUENCY
FROM LOC_TIME_QUERY
JOIN LOCATION ON LOCATION.LID=LOC_TIME_QUERY.LID
JOIN QUERY ON QUERY.QID=LOC_TIME_QUERY.QID
WHERE MATCH (QUERY) AGAINST (‘music’ IN BOOLEAN MODE)
AND TIME>= ‘20081201000000’ AND TIME < ’ 20081203000000’

It is slow. So I decided to check a lighter version considering only the query and the time:

SELECT COUNT(*) AS FREQUENCY
FROM LOC_TIME_QUERY
JOIN QUERY ON QUERY.QID=LOC_TIME_QUERY.QID
WHERE MATCH (QUERY) AGAINST (‘portugal’ IN BOOLEAN MODE)
and time >= ‘20081201000000’ and time < ’ 20081203000000’

it takes several seconds (10sec or so) to retrieve the results.
Considering that this is to be repeated for all the point of the chart…it takes quite a while.

NOTE: When separated, they are fast. That is, if I only ask for the query results that match it is fast…
if I only ask for the time results it is fast…
but not when joined!!!
I know that the query part is performed first… but I see that afterwards the time is not using the index

—EXPLAIN

ID|SELECT TYPE|TABLE|TYPE|POSSIBLE_KEYS|KEY|KEY LEN|REF|ROWS|EXTRA
1, ‘SIMPLE’, ‘QUERY’, ‘fulltext’, ‘PRIMARY,QUERY_NDX_FULL’, ‘QUERY_NDX_FULL’, ‘0’, ‘’, 1, ‘Using where’

1, ‘SIMPLE’, ‘LOC_TIME_QUERY’, ‘ref’, ‘PRIMARY,LTQ_TIME_NDX,LTQ_QID_NDX,LTQ_TIMELID_NDX,LTQ_QIDTIM E_NDX’, ‘LTQ_QID_NDX’, ‘4’, ‘logsbnfull.QUERY.QID’, 1, ‘Using where’


CREATE TABLE LOC_TIME_QUERY (
QID INT UNSIGNED NOT NULL,
TIME DATETIME,
LID INT UNSIGNED,
PRIMARY KEY (QID, TIME)
)ENGINE = MYISAM;

CREATE INDEX LTQ_TIME_NDX USING BTREE ON LOC_TIME_QUERY(TIME);
CREATE INDEX LTQ_LID_NDX USING BTREE ON LOC_TIME_QUERY(LID);
CREATE INDEX LTQ_QID_NDX USING BTREE ON LOC_TIME_QUERY(QID);

I’ve tried other combinations of indexes…

Can I have your hints about improving the performance when considering the relation of query, time and location?

Thx

Nelson

Try a multi-column index on (QID,TIME,LID) in the table LOC_TIME_QUERIES.

And make sure LOCATIONS.LID is indexed (it is probably a primary key, which is fine).

I’ve tried it but the optimizer chooses the index for the QID alone. Even in the case of the second query I presented (the one with only query and time) it doesn’t choose the primary key, which is (qid,time).

I’ve tried to hint the optimizer to use other keys but doesn’t get any better.

While running, I can’t find a way to take advantage of the time indexation…

Thanks

Nelson

Full explain output?

And try IGNORE INDEX / FORCE INDEX.

As I stated earlier, to simplify, right now I will only consider query and time.

SELECT COUNT(*) AS FREQUENCY
FROM LOC_TIME_QUERY2
JOIN QUERY ON QUERY.QID=LOC_TIME_QUERY2.QID
WHERE MATCH (QUERY) AGAINST (‘portugal’ IN BOOLEAN MODE)
and time >= ‘20081202000000’ and time < ’ 20081202210000’

----> 1891
4,8sec

ID|SELECT_TYPE|TABLE|TYPE|POSSIBLE_KEYS|KEY|KEY_LEN|REF|ROWS |EXTRTA
1, ‘SIMPLE’, ‘QUERY’, ‘fulltext’, ‘PRIMARY,QUERY_NDX_FULL’, ‘QUERY_NDX_FULL’, ‘0’, ‘’, 1, ‘Using where’
1, ‘SIMPLE’, ‘LOC_TIME_QUERY2’, ‘ref’, ‘PRIMARY,LTQ2_TIME_NDX,LTQ2_QID_NDX,QIDTIMELID_NDX’, ‘LTQ2_QID_NDX’, ‘4’, ‘logsbnfull.QUERY.QID’, 1, ‘Using where’

The primary key is (qid,time). But I also indexed the following:

SHOW INDEXES FROM LOC_TIME_QUERY2

TABLE|NON_UNIQUE|KEY_NAME|SEQ_IN_INDEX|COLUMN_NAME|COLLATION |CARINALITY|SUB_PART|PACKED
‘loc_time_query2’, 0, ‘PRIMARY’, 1, ‘QID’, ‘A’, , , ‘’, ‘’, ‘BTREE’, ‘’
‘loc_time_query2’, 0, ‘PRIMARY’, 2, ‘TIME’, ‘A’, 2387075, , ‘’, ‘’, ‘BTREE’, ‘’
‘loc_time_query2’, 1, ‘LTQ2_TIME_NDX’, 1, ‘TIME’, ‘A’, 2387075, , ‘’, ‘’, ‘BTREE’, ‘’
‘loc_time_query2’, 1, ‘LTQ2_LID_NDX’, 1, ‘LID’, ‘A’, 132615, , ‘’, ‘YES’, ‘BTREE’, ‘’
‘loc_time_query2’, 1, ‘LTQ2_QID_NDX’, 1, ‘QID’, ‘A’, 2387075, , ‘’, ‘’, ‘BTREE’, ‘’
‘loc_time_query2’, 1, ‘QIDTIMELID_NDX’, 1, ‘QID’, ‘A’, 2387075, , ‘’, ‘’, ‘BTREE’, ‘’
‘loc_time_query2’, 1, ‘QIDTIMELID_NDX’, 2, ‘TIME’, ‘A’, 2387075, , ‘’, ‘’, ‘BTREE’, ‘’
‘loc_time_query2’, 1, ‘QIDTIMELID_NDX’, 3, ‘LID’, ‘A’, 2387075, , ‘’, ‘YES’, ‘BTREE’, ‘’

If I use


FROM LOC_TIME_QUERY2 use index (primary)

or if


FROM LOC_TIME_QUERY2 force index (primary)

I doesn’t get better…or significantly better. 4.5sec

Explain (with force index):

1, ‘SIMPLE’, ‘QUERY’, ‘fulltext’, ‘PRIMARY,QUERY_NDX_FULL’, ‘QUERY_NDX_FULL’, ‘0’, ‘’, 1, ‘Using where’
1, ‘SIMPLE’, ‘LOC_TIME_QUERY2’, ‘ref’, ‘PRIMARY’, ‘PRIMARY’, ‘4’, ‘logsbnfull.QUERY.QID’, 23870, ‘Using where; Using index’


When I try:

FROM LOC_TIME_QUERY2 ignore index (LTQ2_QID_NDX)

the index the optimizer chooses the (qid,time,lid) index…which looks appropriate, but it gets worse: (5.1s)

EXPLAIN (with ignore index):

1, ‘SIMPLE’, ‘QUERY’, ‘fulltext’, ‘PRIMARY,QUERY_NDX_FULL’, ‘QUERY_NDX_FULL’, ‘0’, ‘’, 1, ‘Using where’
1, ‘SIMPLE’, ‘LOC_TIME_QUERY2’, ‘ref’, ‘PRIMARY,LTQ2_TIME_NDX,QIDTIMELID_NDX’, ‘QIDTIMELID_NDX’, ‘4’, ‘logsbnfull.QUERY.QID’, 1, ‘Using where; Using index’

Your atention to this problem of mine is very appreciated. Thanks.
Nelson

The (qid,time,lid) will probably be better for the full query. Your PK suffices for the light query. Check your full query after all optimization is done; if (qid,time,lid) turns out to be not so much faster (or even slower) than the PK-thing, just drop (qid,time,lid).

I doubt the join is your problem. How fast is this query?

SELECT COUNT(*) AS FREQUENCY
FROM LOC_TIME_QUERY2
WHERE MATCH (QUERY) AGAINST (‘portugal’ IN BOOLEAN MODE)

And this one

SELECT QID
FROM LOC_TIME_QUERY2
WHERE MATCH (QUERY) AGAINST (‘portugal’ IN BOOLEAN MODE)