Context: Our application experienced a problem that caused the number of active threads to increase dramatically, reaching a level that was 100 times higher than normal. This led to an outage.
Impact: The number of active threads rose to over 100 from the usual number of less than 10. As a result, the overall CPU utilization exceeded 95%.
Issue: Despite setting the max_execution_time to 60 seconds and the innodb_thread_concurrency to 40, these thresholds were not respected by the mysql node during the outage. The number of active threads exceeded 100, and there were numerous slow queries that took much longer than 60 seconds to complete.
Ask: Why were the innodb_thread_concurrency and max_execution_time values not followed, even after being configured? Also, how does one allow the query killer in such scenarios?
PFA
Once we set the innodb_thread_concurrency to any fixed value, it doesn’t mean mysqld will keep the threads_running always under the value assigned to innodb_thread_concurrency. Once it is set to any fixed value, there are other parameters that also play a role in processing the threads like innodb_concurrency_tickets & innodb_thread_sleep_delay. you can get more info from the reference manual. https://dev.mysql.com/doc/refman/8.0/en/innodb-performance-thread_concurrency.html
About MAX_EXECUTION_TIME , this hint with SELECT is applicable for below only. Did you test your SELECTs earlier with MAX_EXECUTION_TIME hint? Check if the below are applicable to your case.
For statements with multiple SELECT keywords, such as unions or statements with subqueries, MAX_EXECUTION_TIME applies to the entire statement and must appear after the first SELECT.
It applies to read-only SELECT statements. Statements that are not read only are those that invoke a stored function that modifies data as a side effect.
It does not apply to SELECT statements in stored programs and is ignored.
Here you should focus on the time frame once the threads running were spiked and check which queries were responsible for this spike & CPU utilization and try to optimize those queries or solve root causes. Here you could also use the pt-kill tool in case you want to kill specific threads based on user, DB, state, command, execution time, etc…
Thanks, Abhinav, for the update. We have observed that the select queries are being terminated when they exceed the MAX_EXECUTION_TIME limit. However, this does not happen during the outage. We suspect that the database node becomes unresponsive during periods of high resource usage, causing it to deviate from its normal behaviour (such as terminating queries based on the MAX_EXECUTION_TIME setting).
In such situations, will pt-kill or any scheduled query killer cron be effective, unlike the MAX_EXECUTION_TIME setting?
Yes, the pt-kill command may help you in case you have such issues in times of outrage.
However, there are some cases when the pt-kill command may be unsuccessful, and the query may hang in Killed state.
I encourage you to check pt-kill — Percona Toolkit Documentation, and check on test env pt-kill so you can evaluate if pt-kill may help you in your particular case.