my name is Roman and I need help.
In a nutshell: for the last 2 months I’ve been trying to fix issue with my application which is using Percona XtraDB Cluster 5.7. From time to time (from 1 to 10 days) mysql becomes unavailable: all new connections get error message “Too many connections”, but it looks like that mysql does nothing (low CPU usage and disk i/o). Only “kill -9” and restart of mysql server can make it available.
Below I’ll provide more details. I’ll be pleasant for any ideas what can I try to fix the issue.
Some details about my environment. I’m using Ubuntu 14.04 with Percona XtraDB Cluster 5.7:
uname -a Linux **** 4.14.90-37 #1 SMP Tue Dec 25 17:20:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 14.04.5 LTS Release: 14.04 Codename: trusty
mysql --version mysql Ver 14.14 Distrib 5.7.23-23, for debian-linux-gnu (x86_64) using readline 6.3
dpkg -l | grep percona ii percona-repo-config 1.4 all Configures Percona mirror repo ii percona-toolkit 3.0.12-1.trusty amd64 Advanced MySQL and system command-line tools ii percona-xtrabackup-24 2.4.12-1.trusty amd64 Open source backup tool for InnoDB and XtraDB ii percona-xtradb-cluster-client-5.7 5.7.23-31.31-2.trusty amd64 Percona XtraDB Cluster database client binaries ii percona-xtradb-cluster-common-5.7 5.7.23-31.31-2.trusty amd64 Percona XtraDB Cluster database common files (e.g. /etc/mysql/my.cnf) ii percona-xtradb-cluster-server-5.7 5.7.23-31.31-2.trusty amd64 Percona XtraDB Cluster database server binaries
ps uax | grep mysqld mysql 861896 894 85.4 338018320 225641360 ? S<l Feb08 15960:03 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql-3307 --plugin-dir=/usr/lib/mysql/plugin --user=mysql --tmpdir=/var/tmp --lc-messages-dir=/usr/share/mysql --skip-external-locking --bind-address=:: --extra-port=3306 --character-set-server=utf8 --collation-server=utf8_general_ci --explicit-defaults-for-timestamp=1 --innodb-file-per-table=1 --innodb-flush-log-at-trx-commit=2 --innodb-flush-method=O_DIRECT --innodb-log-file-size=10G --slave-sql-verify-checksum=NONE --transaction-isolation=READ-COMMITTED --innodb-buffer-pool-size=192G --innodb-buffer-pool-instances=8 --innodb-checksum-algorithm=crc32 --innodb-io-capacity=5000 --innodb-io-capacity-max=5500 --innodb-log-compressed-pages=OFF --innodb-thread-concurrency=120 --innodb-flush-neighbors=0 --innodb-lru-scan-depth=256 --innodb-purge-threads=8 --innodb-page-cleaners=8 --innodb-buffer-pool-dump-at-shutdown=ON --innodb-buffer-pool-load-at-startup=ON --interactive-timeout=28800 --wait-timeout=28800 --max-allowed-packet=900M --max-connections=1000 --extra-max-connections=1000 --net-read-timeout=3600 --net-write-timeout=3600 --performance-schema-max-digest-length=10240 --thread-cache-size=200 --ft-min-word-len=1 --ft-stopword-file= --read-buffer-size=512K --read-rnd-buffer-size=1M --sort-buffer-size=1M --key-buffer-size=12G --slow-query-log=OFF --slow-query-log-file=/var/log/mysql/mysql-3307-slow.log --long-query-time=15 --binlog-cache-size=4096M --general-log=1 --general-log-file=/var/log/mysql/mysql-3307-general.log --sql-mode=NO_ENGINE_SUBSTITUTION,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ZERO_DATE,NO_ZERO_IN_DATE --replicate-ignore-db=performance_schema --replicate-ignore-db=information_schema --binlog-format=ROW --binlog-ignore-db=mysql --expire-logs-days=1 --enforce-gtid-consistency=ON --gtid-mode=Off --log-bin=/var/log/mysql/mysql-3307-bin.log --max-binlog-files=500 --relay-log=/var/log/mysql/mysqld-3307-relay-bin --server-id=307 --log-error=/var/log/mysql/mysql-3307-error.log --pid-file=/var/run/mysqld/mysqld-3307.pid --socket=/var/run/mysqld/mysqld-3307.sock --port=3307 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1
SELECT VERSION(); 5.7.23-23-57-log
Several words about my application and data. I have PHP based application which is working with 5 mysql shards. Shards are independent on each other. Each shard have master-slave replication to backup server. So, I have 10 servers: 5 masters and 5 replicas. Application is working only with master servers, once a day special cron task takes a backup from each replica by using Percona XtraBackup.
My mysql-servers are very powerful:
lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 56 ...
cat /proc/meminfo MemTotal: 264019652 kB MemFree: 2619616 kB MemAvailable: 32820228 kB Buffers: 920 kB Cached: 31100868 kB SwapCached: 0 kB ...
df -h Filesystem Size Used Avail Use% Mounted on udev 126G 4.0K 126G 1% /dev tmpfs 26G 16M 26G 1% /run /dev/md0 32G 5.2G 27G 17% / none 4.0K 0 4.0K 0% /sys/fs/cgroup none 5.0M 0 5.0M 0% /run/lock none 126G 28K 126G 1% /run/shm none 100M 0 100M 0% /run/user /dev/md2 1.8T 1.3T 505G 72% /srv/local tmpfs 128M 4.0K 128M 1% /var/tmp/skynet/cqudp
Data is located in /srv/local partition, there are enough room for the data on this partition.
My databases are big: each shard have size ~500 Gb. There are about 50 tables in my DBs, but only several of them contains almost all data. All shards have roughly speaking the same size, the same number of requests, the same number of clients. One important fact: I faced my issue only on one shard.
My application is using a concept of “workers”. Worker is a short php-program which usually working in the following way:
- start and establish connection to mysql-server,
- get some data from DB,
- get some data from external API,
- do some calculations and put data back to DB,
- close connection to DB.
In common case I have not more than 200 workers working with each shard and I have max_connections variable in mysql-server set to value 1000. In common case it’s much more then enough to work wihout any issues. Workers are usually working very fast: some of them can do their job during several seconds, but some of them can work during minutes or even dozens of minutes. Usually I have hundreds workers starts and finishes per every minute.
Most part of queries made by workers are UPDATEs. I have a monitoring based on Percona’s script https://github.com/percona/percona-monitoring-plugins/blob/master/cacti/scripts/ss_get_mysql_stats.php, here is screenshot from it:
When I faced issue with “Too many connections error” I checked that I didn’t have more workers then usual. No one besides workers can connect to DB. First of all I tried to optimize my application and separate queries which update a lot of data to smaller queries and distribute them in time. It didn’t help.
Then I tried to compare settings between all my DB servers and found, that problem server has default value of variable binlog_cache_size - 32Kb, whereas on other servers this variable have value 100Mb. Then I also found that value of binlog cache usage is enormous on problem server - millions per day, whereas on other servers it’s something like couple of thousands each day.
When I set binlog_cache_size to 100 Mb on problem server it was working without problems for 10 days, before that it was working not longer then 2-3 days. But after 10 days it again becames unavailable with the message “Too many connections”. I also noticed that when mysql-server starts serving error “Too many connections” it’s extensively using CPU and doing lot of writes on disk. But after couple of minutes CPU usage and i/o goes down almost to 0, but mysq still serving error “Too many connections”.
I have logs from atop utility, which allows to log system state every minute. Below you can find a couple of screenshots made at the moment of error starts at 10:33.
I didn’y have general log enabled on my problem server and I decided to enable it (writing data to file) in hope to find something interesting in it at the moment of error “Too many connections” begins. I haven’t found anything yet in this log, but my mysql server broke down twise during 2 days I enabled this log. Looks like there too many disk i/o operation and this might be a reason of error. I have these SSDs in my server: https://www.micron.com/products/solid-state-drives/product-lines/5100 and I found that they can deal with up to 500 Mb/s of writes and 43000 iops. Look like I didn’t reach these limits in my case.
Does anyone has any ideas what should I try to find and fix the reason of issues with my DB server?