Posted this on the official forums a week ago and didn’t get a response, hopefully I’ll have more luck here, or if my question doesn’t provide enough information please let me know and I’ll give whatever I can.
MySQL Version: 5.0.45-7.el5 (current latest for RHEL5)
Every so often (it can be anything from a few hours to a few weeks), my MySQL process on a dedicated DB machine jumps to 100% CPU usage and stays there until restarted.
At first I thought it was my application recursing queries to the DB server but I shut all webservers down and left the DB running for an hour and the CPU usage did not let up. This leads me to believe that its a single (or small number) of bad queries causing problems within MySQL.
My problem is that I don’t know where to start to diagnose this.
I’ve enabled log-slow-queries of course but nothing shows up at all. If the application doesn’t run or has low usage, the problem doesn’t occur so I’m reasonably sure its some queries by the application… but its a big big app with many queries during normal usage, can’t really log all the queries at an app level and expect to be able to sift through them to find a problem (especially since it looks like this is a rarely called query given that it doesnt happen too often).
Anyway, any ideas how I can go about solving this?
If I were you, I’d try to rule out a change in query load first. When your app is running normally, you should be able to get an idea of what the normal queries per second is. If you’re not monitoring this already, you can do so manually with SHOW GLOBAL STATUS LIKE ‘Questions’. I usually try to shove as much of info from show status as I can into tools like cacti or nagiosgraph so that when something weird happens I can go back and see what changed.
Once you know your expected query load, measure it again when the problem next occurs. If queries per second goes up, then you know you’ve got some application abusing the database, but if it goes down, it is likely that a set of problematic queries are monopolizing the CPU. In the later case, SHOW FULL PROCESSLIST should help you out, since queries like that shouldn’t disappear from the list quickly.
If that doesn’t pan out, you might have to start exploring the possibility that the problem is much lower level, like a bug in MySQL or in your linux kernel.
Another thing to try if its too involved to shut down all apps which access the database is to firewall them out using iptables, provided it doesn’t put your business at risk.
These kinds of problems are usually pretty tough to solve, so best of luck.
I wasn’t aware of SHOW FULL PROCESSLIST (you learn something new everyday…) so that helped in finding out which query caused the problem, and in the short term let me kill that thread without having to restart the whole mysql process.
If you’ve identified the query, and its not slow because of lock contention, EXPLAIN should go a long way to helping you figure out what’s going on. If you want to post the query and the explain plan, I’d be happy to speculate some more.