Not the answer you need?
Register and ask your own question!

Help Troubleshooting a behaivor with max connections and Max treads.

albeda1albeda1 Current User Role Contributor
Hi People,  It's my first question here, after looking in a lot of places before. Any help will be welcome!
Percona MySQL  5.7.24-27 running over a private on premise vmware vm, with external HP SAS 15k rpm disks, with 32 cores and 32 GB of ram.  Only MySql running here, no other app nor softwares.  Linux CentOS 7
MySQL Conf
max_allowed_packet=4M
innodb_log_file_size=2G
innodb_flush_log_at_trx_commit=1
innodb_flush_method=O_DIRECT
max_connections = 950   
innodb_buffer_pool_size=15589934592         
innodb_buffer_pool_chunk_size=2073741824
innodb_buffer_pool_instances=20
thread_handling=pool-of-threads

I cannot solve a behavior I see when a high load comes from the app server.  By the way, during this high load period, the app server become very unstable and very slow. Impossible to work with because the DB is not returning data.  The worst part, during that period, the DB server answers flawesly using the cli.

So, after lot of test... I do not understand why when my max connection limit is at near 1000 connections, MySql Stops at near 250 concurrents, and no more.  (And I'm sure there are a lot more wating... )

So when this limit is reached, the app server stop responding... 

Disk seams normal.. Lot of IOPs available... processor is not idle, but lot of capacity is available during this high load period. 
Attaching a graph to be more precise. You can see it stops at 280 connections and cannot climb to max.  
I'm really running nuts...

Comments

  • vaibhav_upadhyay40vaibhav_upadhyay40 Contributor Current User Role Patron
    1) share the processlist and CPU utilisation during peak/incident
    commands:
    mysql> select * from information_schema.processlist order by time desc;
    shell> sar or graph (select few hours before and after peak/incident)

    2) Share the app server resource utilisation during the peak.
    CPU and Memory at least. Good to have iops also

    3) share below mysql command output
    mysql> show engine innodb status \G 

    5) any error msg on app side while connecting to DB during peak/issue? please check. There can be a bottleneck on app side as well (probability) 

    6) share below output 
    app --> shell> ulimit -a
    db --> shell> ulimit -a
    db--->     mysql> show global variables; 

    PS: nothing to with above issue.
    *round of the innodb_buffer_pool_size if possible.
    *For best efficiency, specify a combination of innodb_buffer_pool_instances and innodb_buffer_pool_size so that each buffer pool instance is at least 1GB.  For e.g. if buffer pool is 15GB then innodb_buffer_pool_instances  should not be > 15.
  • albeda1albeda1 Current User Role Contributor
    Thanks a lot for your time vaibhav_upadhyay40, i really appreciate it.

    1 and 2 are not easy to answer, cause I have not recorded that info at that time, and I cannot reproduce it. Let me explain better.
    I'm running a Moodle site, with 15000 students. The site goes well with lots of students using the plattform, 1000, 1500 logged in-no problem.... but the problem starts in an exam, when near 2000 needs to access to the app server at the same time. so, it hangs after login.  The error is the typical 503, cause the SQL cannot answer fast enough, and the appserver ends the connections cause timeout.
    In front of the MySql server, we have NGINX+Varnish acting as SSL offload server and reverse proxy/cache.  in another server, runs the apache with php-fpm and moodle.. and the last its the Database.  all over a cluster of 3 Vmware baremetal hosts, with plenty of capacty.  
    The 3 vm uses 100 cores and 100gb of ram.  The Storage is a HP 2040 SAS one, with SSD discs in RAID1. 

    I cannot reproduce the behaivor cause it's a live environment, and when hangs, lot of students wants to kill me.. so.. for now, we are taking smaller exams, with less students... until we found some kind of solution to test.

    So, for 1 and 2 question, I have no historic data to show, but trust me.. about processlist,CPU utilisation, memory, IOPS and Internet Bandwidth, was pretty free, not near 20%... this is why I'm surprised with the graph showed in my first post, showing a clear limit in Connections.  I'm also behind a problem in the APP server, but everithing seems normal... 

    About 3 and 6, I'm uploading in two files the results.  About Ulimit, I think they are enough...
    prlimit --pid 11943  (the values are the same for all the app involved nginx, apache, varnish, etc, this is from the MYSQL)
    RESOURCE   DESCRIPTION                             SOFT      HARD UNITS
    AS         address space limit                unlimited unlimited bytes
    CORE       max core file size                         0 unlimited blocks
    CPU        CPU time                           unlimited unlimited seconds
    DATA       max data size                      unlimited unlimited bytes
    FSIZE      max file size                      unlimited unlimited blocks
    LOCKS      max number of file locks held      unlimited unlimited
    MEMLOCK    max locked-in-memory address space     65536     65536 bytes
    MSGQUEUE   max bytes in POSIX mqueues            819200    819200 bytes
    NICE       max nice prio allowed to raise             0         0
    NOFILE     max number of open files              100240    150240
    NPROC      max number of processes               127946    127946
    RSS        max resident set size              unlimited unlimited pages
    RTPRIO     max real-time priority                     0         0
    RTTIME     timeout for real-time tasks        unlimited unlimited microsecs
    SIGPENDING max number of pending signals         127946    127946
    STACK      max stack size                       8388608 unlimited bytes

    So.. any advice or trick... will be welcome.. cause this is giving me nuts...



  • albeda1albeda1 Current User Role Contributor
    By the way, I also changed in the three server 

    echo 2000 65000 > /proc/sys/net/ipv4/ip_local_port_range

    who knows... this is a change I did after the last problem, did not got the chance to test in the live system.
  • vaibhav_upadhyay40vaibhav_upadhyay40 Contributor Current User Role Patron
    MySQL
     "cause the SQL cannot answer fast enough, and the appserver ends the connections cause timeout"

    If slow-query-log is enabled, please check the slow queries and identify the which query(s) is not fast enough. 
    If not then i suggest you to enable it. set the log-query-time as per your expected/standard SLA for the database response. 
    Since you are using PMM for monitoring you can refer to "MySQL Overview" dashboard --> graph  name "MySQL Slow Queries"

    php-fpm
    also check the php-fpm logs, verify the max children , max request are enough and efficient. 
    Isolate which call/code is experiencing slowness or timeout during the incident. 

    Nginx

    check the logs for any errors/warnings, verify the 
    worker_process and worker_connections are efficiently tuned. 
Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.