After upgrading all my percona servers from 8.0.39 to 8.0.41, two of my servers show very high CPU load. At night we sync data from external sources with means allot of deletes, inserts and some analytical queries. A peak in CPU usage was normal, but since the update it spikes to 100% for hours. The other serves, where the issues are not present have the same hardware, same percona version, and similar type of data and load, but don’t have the hours long spiking.
System:
Debian bookworm - kernel 6.1.0-31-amd64 with percona 8.0.41-32
Using pt-config-diff I verified the config between working servers, and troubled server is the same.
I’m a bit stuck in how to troubleshoot any further.
Additional info: I have PMM 2.44 to help me troubleshoot (alhtough this has not really helped so far).
Thanks in advance for any help.
@alexanderdeprez
What is the role of those 2 servers where you observed high CPU load ? I mean are they part of any cross region (DR) or the Source (Read/Write) or if only serves the Read (Replica) ?
If (OS, DB )everything looks exactly same then reviewing the workload might be good idea. You can check the slow query logs or pt-query-digest report to get down to the bottleneck queries.
Since you having PMM then you can check QAN section as well to measure the query performance and differences
You can also check various [MySQL/Innodb] specific dashboard to check for the performance or other differences to find the hotspot.
High CPU usage mainly relatable to the unoptimized workload or queries. Fixing such DML’s would for sure reduce the consumption.
This blogpost - A Simple Approach to Troubleshooting High CPU in MySQL might come handy in order to get down to the heavily resource (CPU) consumed DML’s.
Hi, thank you for your reply. The servers are actualy just standalone. It seems that some bad queries could be the cause, but I find it hard to troubleshoot usign PMM since the load in QAN is influenced by the fact that CPU is at 100% for hours. So this causes even basic queries to have a high load in QAN because they now take lots of time, not because they are heavy on the system, but because the system is overloaded. So at this point load, duration,… is not an accurate representation.
I just updated my PMM 2.44 to PMM 3 with associated agents, maybe this wel help me with better insight.
The blog post you refer to is one I have read, and applied the techniques, but similar to my above statement, it seems lots of queries have a high load, again, because the system is overloaded.
Any additional tips would be helpfull to help figure out what queries cause an issue.
To give you an idea of the scale. The total size of all databases is around 1T, about 150 databases on each server, and at night, thousands of queries are done. Until the update we would only see minor spikes that max out at around 70% utilization, while now it goes up to 100% and stays like that for about 6 hours.