Help Troubleshooting a behaivor with max connections and Max treads.

Hi People,  It’s my first question here, after looking in a lot of places before. Any help will be welcome!
Percona MySQL  5.7.24-27 running over a private on premise vmware vm, with external HP SAS 15k rpm disks, with 32 cores and 32 GB of ram.  Only MySql running here, no other app nor softwares.  Linux CentOS 7
MySQL Conf
max_allowed_packet=4M
innodb_log_file_size=2Ginnodb_flush_log_at_trx_commit=1innodb_flush_method=O_DIRECTmax_connections = 950   innodb_buffer_pool_size=15589934592         innodb_buffer_pool_chunk_size=2073741824innodb_buffer_pool_instances=20
thread_handling=pool-of-threads

I cannot solve a behavior I see when a high load comes from the app server.  By the way, during this high load period, the app server become very unstable and very slow. Impossible to work with because the DB is not returning data.  The worst part, during that period, the DB server answers flawesly using the cli.
So, after lot of test… I do not understand why when my max connection limit is at near 1000 connections, MySql Stops at near 250 concurrents, and no more.  (And I’m sure there are a lot more wating… )
So when this limit is reached, the app server stop responding… 
Disk seams normal… Lot of IOPs available… processor is not idle, but lot of capacity is available during this high load period. Attaching a graph to be more precise. You can see it stops at 280 connections and cannot climb to max.  
I’m really running nuts…


  1. share the processlist and CPU utilisation during peak/incident
    commands:
    mysql> select * from information_schema.processlist order by time desc;
    shell> sar or graph (select few hours before and after peak/incident)

    2) Share the app server resource utilisation during the peak.
    CPU and Memory at least. Good to have iops also

    3) share below mysql command output
    mysql> show engine innodb status \G 

    5) any error msg on app side while connecting to DB during peak/issue? please check. There can be a bottleneck on app side as well (probability) 

    6) share below output 
    app –> shell> ulimit -a
    db –> shell> ulimit -a
    db—>     mysql> show global variables; 

    PS: nothing to with above issue.
    *round of the innodb_buffer_pool_size if possible.
    *For best efficiency, specify a combination of innodb_buffer_pool_instances and innodb_buffer_pool_size so that each buffer pool instance is at least 1GB.  For e.g. if buffer pool is 15GB then innodb_buffer_pool_instances  should not be > 15.

Thanks a lot for your time vaibhav_upadhyay40, i really appreciate it.

1 and 2 are not easy to answer, cause I have not recorded that info at that time, and I cannot reproduce it. Let me explain better.
I’m running a Moodle site, with 15000 students. The site goes well with lots of students using the plattform, 1000, 1500 logged in-no problem… but the problem starts in an exam, when near 2000 needs to access to the app server at the same time. so, it hangs after login.  The error is the typical 503, cause the SQL cannot answer fast enough, and the appserver ends the connections cause timeout.
In front of the MySql server, we have NGINX+Varnish acting as SSL offload server and reverse proxy/cache.  in another server, runs the apache with php-fpm and moodle… and the last its the Database.  all over a cluster of 3 Vmware baremetal hosts, with plenty of capacty.  
The 3 vm uses 100 cores and 100gb of ram.  The Storage is a HP 2040 SAS one, with SSD discs in RAID1. 

I cannot reproduce the behaivor cause it’s a live environment, and when hangs, lot of students wants to kill me… so… for now, we are taking smaller exams, with less students… until we found some kind of solution to test.

So, for 1 and 2 question, I have no historic data to show, but trust me… about processlist,CPU utilisation, memory, IOPS and Internet Bandwidth, was pretty free, not near 20%… this is why I’m surprised with the graph showed in my first post, showing a clear limit in Connections.  I’m also behind a problem in the APP server, but everithing seems normal… 

About 3 and 6, I’m uploading in two files the results.  About Ulimit, I think they are enough…
prlimit --pid 11943  (the values are the same for all the app involved nginx, apache, varnish, etc, this is from the MYSQL)RESOURCE   DESCRIPTION                             SOFT      HARD UNITSAS         address space limit                unlimited unlimited bytesCORE       max core file size                         0 unlimited blocksCPU        CPU time                           unlimited unlimited secondsDATA       max data size                      unlimited unlimited bytesFSIZE      max file size                      unlimited unlimited blocksLOCKS      max number of file locks held      unlimited unlimitedMEMLOCK    max locked-in-memory address space     65536     65536 bytesMSGQUEUE   max bytes in POSIX mqueues            819200    819200 bytesNICE       max nice prio allowed to raise             0         0NOFILE     max number of open files              100240    150240NPROC      max number of processes               127946    127946RSS        max resident set size              unlimited unlimited pagesRTPRIO     max real-time priority                     0         0RTTIME     timeout for real-time tasks        unlimited unlimited microsecsSIGPENDING max number of pending signals         127946    127946STACK      max stack size                       8388608 unlimited bytes

So… any advice or trick… will be welcome… cause this is giving me nuts…


By the way, I also changed in the three server 

echo 2000 65000 > /proc/sys/net/ipv4/ip_local_port_range

who knows… this is a change I did after the last problem, did not got the chance to test in the live system.

MySQL
 “cause the SQL cannot answer fast enough, and the appserver ends the connections cause timeout”

If slow-query-log is enabled, please check the slow queries and identify the which query(s) is not fast enough. 
If not then i suggest you to enable it. set the log-query-time as per your expected/standard SLA for the database response. 
Since you are using PMM for monitoring you can refer to “MySQL Overview” dashboard –> graph  name “MySQL Slow Queries”

php-fpm
also check the php-fpm logs, verify the max children , max request are enough and efficient. 
Isolate which call/code is experiencing slowness or timeout during the incident. 

Nginx

check the logs for any errors/warnings, verify the 
worker_process and worker_connections are efficiently tuned.