Until 20:16:02 CET slow-query log is almost empty. Then suddenly, query which normally taking up to 2 seconds taking 8 seconds. In next 9 minutes slow-query.log is filled with queries taking much much more they are normally do. Again same query which taking up to 2 seconds taking 20 seconds (!) at 20:16:22.
Time: 100129 20:22:12
Query_time: 205 Lock_time: 102
Query_time: 307 Lock_time: 199
Query_time: 181 Lock_time: 0
No always same time, so this problem random appearing so not for example some cron-job. Anyone have idea what to check, what this could be?
Ok today i gathered more diagnostics.
Web server restart were at:
2010-02-06 11:09:08
2010-02-06 12:49:09
2010-02-06 13:36:09
which indicates time of problem. Then i looked at gathered diagnostics from this times.
On 11:09 (first restart) usage of /dev/sdb which got 3 most used mysql tables were 100% for some times. Then again on 12:49 and 13:36 on next two restarts same problem 100% usage whole HDD by 3 tables which normally takes up to 20%.
as you see CPU waiting time is 100% almost. Next minute also is that high. On 11:11 - 11:12 WA is around 20 then on 11:13 back to normal WA is from 3 to 11.
So every of this choke (HDD/CPU/MYSQL) could be from one another. So HDD could bloke mysql and CPU, but CPU could be reason to block HDD and mysql.
How to find what causing this really and how to fix it?
“how to fix it” is not a valid question until you find out what the problem is. Look again at vmstat and think more about the relationship between iowait in the CPUs and disk IO. The CPUs are not causing the disks to “block.” The disks aren’t blocking. Look – they are reading tons of data. The CPUs are blocked waiting for that to complete.
Something is asking the disks to read data. It could be mysqld or something else. Is there a backup running? Is there a long-running query? You need to write a script that gets ‘ps’ and SHOW PROCESSLIST to answer these questions.
We were forced to cut page viewing from for example 3000 pages to first 25 pages because of degradatin performance with every next page.
I will look into this sphinx pagination thing but sphinx got issues itself (returning false matches sometimes etc.) so im afraid it will not be 100% exact in pagination as it should be.
Another thing yesterday (07.02.2010) problem was minimalized maybe from not biggest traffic that day or changes that were made (packet size from 16 back to 1 and threads cache size from 190 to 150 with restarts of both mysqld’s).
Web server didn’t break whole day, and choke’s was for 95% and maximum 97% of disk capabilities.
17:59 - only for 4 seconds disk usage was 60%, 90%, 71%, 76%
20:26 - minimum 4 seconds 75%, 95%, 97%, 79% dont know more…
only 1 query from that times in slow query log )
17:59 similiar like query described above LIMIT 784, 16;
20:26 similiar like query described above LIMIT 784, 16;
Here we go our problem is probably disk seek times because of big LIMIT’s.
) quite happy now, maybe some advices on sphinx pagination and its reliability.
Did you ever encounter sphinx returning false matches? For example exactly like described here:
[URL=“http://omg why sphinx making so big search mistakes | Sphinx”]omg why sphinx making so big search mistakes | Sphinx
We too searching with phrase like big number for example: 2256797652
and results are found in posts which dont have this number at all.
I know i should use probably SetLimits(xxx/xx) involving somehow this 864 id’s (above query) but not yet sure.
UPDATE: confirmed it 100% by tests ) big LIMIT causing disk usage jumping from 0% to 99% and on high traffic this causing massive disk blocking which resulting in queries queueying for minutes.