Hello
I’ve tried probably everything so You’re my last hope. So, basically, every 1 hour and 40 minutes there is a load spike on the graph from the monitoring software that’s on my server.
For a moment I’ve thought it could be a bogus load (which was mentioned in two or three articles on the internet) from the mrtg that I was using . I’ve moved everything to the rrdtools and made sure each reading would be executed with the random delay (the same for the graphs generators), but the load spikes didn’t change the frequency, nor the amplitude. I have the graphs for the cpu, memory, network and www server activity set as well, but I can’t see any correlation between them and the load graph. Thanks to the post, that was made several years ago about monitoring mariadb on the blog, I’ve set the simple select/update/insert/delete monitor. It helped me to find a webscraper that was fetching the data from my website at the night hours, but it didn’t reveal the right reason for the load to go up.
I have the monitoring system set up so it would send me an email with the additional info (like the process list, queries executed by the mysql, etc), and almost in every case mysql (which us MariaDB right now) was taking 100% time of the cpu. Unfortunately, the slow+not using indices log, couldn’t provide any additional information about this case, because for the database everything was fine and dandy.
What else I should check to see why my server thinks it is being overloaded?
You need to configure your slow log in MariaDB to log all queries and ensure the slow log is actually turned on. Set long_query_time=0
and slow_query_log=ON
then check.
Done. Nothing spectacular. I’ve cut the part of the log that’s between specific timestamps, when the spike occurred and put it into pt-query-digest. It says:
Exec time max = 539ms
Lock time max = 223ms
Rows sent max = 11.6k
Rows examine max = 29.33k
Rows affected max = 43
Query size max = 92.41k
Filesort 65% yes, 34% no
Full scan 18% yes, 81% no
Priority que 24% yes, 75% no
QC hit 45% yes, 54% no
Tmp table 68% yes, 31% no
Tmp table on 5% yes, 94% no
I’ve remembered that mysql can have scheduled tasks and I’ve checked it, but there was nothing in the events table.
So, one more idea, that I have, is to grab the queries from each load spike, compare them and find the one that’s happening every time. It would probably be update/insert/delete, since select isn’t the one that’s blocking the tables, right?
Hi @januzi
Thanks for reaching out.
I understand that you see sudden CPU spike at every 100 minutes and for this you have seen slow query logs but didnt find much. Please let me know if my understanding is not correct.
Before going further, I have a basic check. Do you see sudden spike in traffic as well ? What is the amount of running threads in normal hours and what is the value at time of spike ? What is the configuration of your machine ?
Yep, load spikes every 100 minutes. The graphs for the lo and eth are flat, the same for the requests processed by Apache, reads+writes on the hard drive, etc. If there’s a spike it’s at a different time that doesn’t match the load spikes. As for the threads (Threads_running in the global status, right?) I’ve added it one hour ago just before the spike, and it’s pretty the same all the time. Right now it’s 2, and sometimes it jumps up to 3. The graph for the queries is also flat and the ratio of the select/insert/delete/update is pretty the same all the time: ~90% of select, ~8% of update or insert, and then delete.
As for the configuration: 64G of ram, Intel(R) Xeon(R) CPU E3-1275 v6 @ 3.80GHz (8 cores). As for mariadb I’ve set the query cache (128M) and open tables limit (60k), and I think I’ve set innodb pool size to 1G. The rest of the settings are the default ones that came with Debian.
I’ve set up the atop. The problem is I have got no idea what to look for. The numbers are pretty much the same, except when there’s a spike. Then the avg1/5/15 go up.
That’s how those spikes look like. I’ve marked the load that was caused by the marketing emails being sent. That one is correlated to the cpu usage, hard disk writes, network traffic, context switching and queries. (Btw, I think that the scale isn’t right, I probably messed up something with updating the rrd file, because a lot of those spikes were 4.5, and some were at least 5 when my alert script had sent me an email with the ps output).
Hey @januzi, you really need to get off the ancient RRD/MRTG and move to something modern. I highly recommend that you scrap RRD/MRTG and just install Percona Monitoring and Management (PMM) Your current monitoring is missing out on so much useful information to correlate your issue. PMM will also gather all your query data as it happens; you won’t need to post-process using pt-query-digest anymore.
Filesort 65% tells me your queries are not optimal or you have incorrect indexes that don’t match WHERE clause and ORDER BY clause.
18% of your queries are full-table-scans. Again, missing/bad indexes.
Turn off your query cache. Only 45% hits? That’s doing more harm than good.
68% of queries use a temporary table? Again, missing/bad indexes.
Hi @januzi ,
Thanks for your response. In addition to what Matthew has mentioned, I would also like to highlight the innodb_buffer_pool_size which has been set to 1G only. If my understanding is correct, kindly increase it . I am not sure what is the size of your database but if this is too small as per the database size, kindly resize it.
Hello
It took me a few days to check some things, fix the queries, rewrite the scripts, and so on. So, at first I’d like to thank @matthewb and @Ankit_Kapoor1 for your input. I’ve took care of the issues you’ve pointed out. I’ve changed the innodb buffer pool size, replaced the queries that was doing full table scans, order by rand()/group, etc.
I’ve also paid more attention to the article about unused servers having high load, because the monitoring software was hitting the same timestamps as the load monitor in the kernel. This gave me a thought that a tiny bumps in the number of the processes can have a huge impact on the result. I’ve created a new graph (I’ll use PMM the next time I have a problem, I swear) and then compared load graph with the number of the cron tasks.
I’ve merged bash scripts into a single ones (when possible) and transferred most of the php-cgi requests to the proper poller scripts.
I’ve deployed the changes in queries and the cron requests right after 1600, and then some more cron request fixes before 2200.
Below is the graph of the load (the upper part) and the number of the tasks that are being executed by the cron.