Suddenly huge queries speed degradation for couple minutes

Until 20:16:02 CET slow-query log is almost empty. Then suddenly, query which normally taking up to 2 seconds taking 8 seconds. In next 9 minutes slow-query.log is filled with queries taking much much more they are normally do. Again same query which taking up to 2 seconds taking 20 seconds (!) at 20:16:22.

Time: 100129 20:22:12

Query_time: 205 Lock_time: 102

Query_time: 307 Lock_time: 199

Query_time: 181 Lock_time: 0

No always same time, so this problem random appearing so not for example some cron-job. Anyone have idea what to check, what this could be?

Gather more diagnostics. Get IO, CPU, process, and samples of SHOW PROCESSLIST, SHOW INNODB STATUS, SHOW GLOBAL STATUS at 10-second intervals.

Ok today i gathered more diagnostics.
Web server restart were at:
2010-02-06 11:09:08
2010-02-06 12:49:09
2010-02-06 13:36:09

which indicates time of problem. Then i looked at gathered diagnostics from this times.

On 11:09 (first restart) usage of /dev/sdb which got 3 most used mysql tables were 100% for some times. Then again on 12:49 and 13:36 on next two restarts same problem 100% usage whole HDD by 3 tables which normally takes up to 20%.

Next is vmstat logs (from 11:09)

procs -----------memory---------- —swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 75 436076 181108 4220 646940 0 0 15 74 0 0 19 3 76 2 0 0 68 436076 181212 4268 648652 0 0 1820 92 1514 1576 2 12 0 87 0 0 67 436076 180016 4272 650024 0 0 1328 8 1375 913 0 1 0 99 0 0 64 436076 177796 4276 652220 0 0 2200 4 1533 1209 1 0 0 99 0 0 63 436076 176272 4276 653800 0 0 1528 0 1464 1108 1 1 0 99 0 1 59 436076 175176 4308 655748 0 0 1740 472 1627 1525 4 1 0 96 0 0 60 436076 173800 4308 657076 0 0 1432 0 1611 1445 2 1 0 97 033 60 436076 385480 4316 658388 0 0 1384 0 1628 1508 3 6 0 91 0 0 64 385636 995628 4392 595104 0 0 1812 8 4329 8001 60 18 0 22 070 105 385632 817292 4440 611452 4 0 2284 4 4511 5055 78 12 0 10 0

as you see CPU waiting time is 100% almost. Next minute also is that high. On 11:11 - 11:12 WA is around 20 then on 11:13 back to normal WA is from 3 to 11.

So every of this choke (HDD/CPU/MYSQL) could be from one another. So HDD could bloke mysql and CPU, but CPU could be reason to block HDD and mysql.

How to find what causing this really and how to fix it?

“how to fix it” is not a valid question until you find out what the problem is. Look again at vmstat and think more about the relationship between iowait in the CPUs and disk IO. The CPUs are not causing the disks to “block.” The disks aren’t blocking. Look – they are reading tons of data. The CPUs are blocked waiting for that to complete.

Something is asking the disks to read data. It could be mysqld or something else. Is there a backup running? Is there a long-running query? You need to write a script that gets ‘ps’ and SHOW PROCESSLIST to answer these questions.

switch to dedicated search software, such as sphinx.

[B]Quote:[/B]
switch to dedicated search software, such as sphinx.
who told im not on it already? ;] but results of search must be pulled from DB like here.

We were forced to cut page viewing from for example 3000 pages to first 25 pages because of degradatin performance with every next page.

I will look into this sphinx pagination thing but sphinx got issues itself (returning false matches sometimes etc.) so im afraid it will not be 100% exact in pagination as it should be.

Another thing yesterday (07.02.2010) problem was minimalized maybe from not biggest traffic that day or changes that were made (packet size from 16 back to 1 and threads cache size from 190 to 150 with restarts of both mysqld’s).

Web server didn’t break whole day, and choke’s was for 95% and maximum 97% of disk capabilities.
17:59 - only for 4 seconds disk usage was 60%, 90%, 71%, 76%
20:26 - minimum 4 seconds 75%, 95%, 97%, 79% dont know more…

only 1 query from that times in slow query log )
17:59 similiar like query described above LIMIT 784, 16;
20:26 similiar like query described above LIMIT 784, 16;

Here we go our problem is probably disk seek times because of big LIMIT’s.

) quite happy now, maybe some advices on sphinx pagination and its reliability.
Did you ever encounter sphinx returning false matches? For example exactly like described here:
[URL=“http://omg why sphinx making so big search mistakes | Sphinx”]omg why sphinx making so big search mistakes | Sphinx
We too searching with phrase like big number for example: 2256797652
and results are found in posts which dont have this number at all.

I know i should use probably SetLimits(xxx/xx) involving somehow this 864 id’s (above query) but not yet sure.

UPDATE: confirmed it 100% by tests ) big LIMIT causing disk usage jumping from 0% to 99% and on high traffic this causing massive disk blocking which resulting in queries queueying for minutes.