We’re running MySQL server (Percona) and we’re experiencing crashes because of the infamous OOMkiller. This has been going for over a year and we’ve tried a lot of things, but we don’t have a definitive solution yet. I’ve read all topics I could find about the subject but none of them contained the solution.
Let me explain the situation:
There are multiple servers running our software and all servers experience the same issue. The servers are virtual, they have 4 cores, 16 GB memory, SSD disks and they run Apache and PHP besides MySQL. The memory usage of apache and php is small (about 1 GB), most of the work is done by the MySQL server.
There are 20 to 23 databases on each server, each database contains about 350 tables (and the number of tables grows with every release). The tables are all InnoDB and the total database size varies for every database, the smallest one is about 360 MB and the largest one about 34 GB. The large size is partly due tot BLOBs.
The databases are heavily used. There are ofcourse a lot of SELECT queries (nearly all of them using JOINS) but there is also a very high number of INSERTs because of audit logging. Furthermore, triggers are used on most INSERT, UPDATE and DELETE queries.
It is possible to create a stable situation without crashes, but it’s a thin line. When another database is added to the server, it starts crashing again. We’re using Munin for monitoring and you can see the memory usage building up towards the point where the memory is full. It then continues using swap and when that is exhausted the OOMkiller steps in and kills the MySQL server process, so it can start all over again.
Depending on the global usage (workdays vs. weekends) there might be multiple crashes a day or none at all. A full backup using mysqldump can cause a crash as well when the memory usage is already high after a day’s hard work.
So far, we’ve deducted that there is not one single cause for the high memory usage. It seems to be a combination of the number of databases (not so much the size of the data), the number of connections and the usage.
We’ve tried many things and we’ve come up with a config with a very small InnoDB buffer pool: 1 GB and 4 instances. Furthermore we’ve limited the max connections to 150 and the thread cache size is 10. A larger thread cache (as calculated by the percona config generator) also causes quickly rising memory usage. We figured that our threads can use a lot of memory because of queries with a lot of JOINs. If all this memory is kept in cache, it will eventually fill up the system’s memory.
Then we’ve changed the application to use persistent connections, this also helps in keeping the memory usage down (but persistent connections can’t always be used, if multiple processes connect to the same database). And finally we’ve disabled the performance schema to save memory as well.
These settings can keep a server running, but it’s still the thin line I’ve talked about before. When more databases are added (which involves more connections and a higher workload), the crashes start coming again.
We were wondering if somebody has an idea where all the memory goes. We think it should be possible to run more than 20 databases on one server and maybe increase the InnoDB buffer pool (although we are not experiencing performance issues at the moment).
Or could it be that this amount of databases, connections an workload is the max of what one of these servers can handle? So we’ve reached the limit?
At the moment we’ve managed to keep one server running for 19 days, I guess because of the quieter holiday period. Because of the long uptime, I can provide some statistics (from ‘SHOW GLOBAL STATUS’)
uptime: 1722760 sec = 19,9 days
selects: 33146297 = 19.2 per sec
inserts: 5152891 = 3 per sec
updates: 4850987 = 2.8 per sec
deletes: 24471 = 0.01 per sec
queries: 97821785 = 56,8 per sec
slow queries: 8196 = 0.005 per sec
connections: 3667483 = 2,12 per sec
threads_created = 784776 = 0,45 per sec
I’ve thought about running a binary with debug symbols to see where the memory is going, but these are production machines and since the crashes only occur during busy times, I don’t think this is a good idea. Als setting up an extra server for load testing doesn’t do the trick because the exact usage is hard to replicate. We do have some tests which can make the server crash (running a large data import with a lot of post-processing produces a lot of queries in a short time and can crash the server with some settings), but this is also not 100% perfect.
Attached you’ll find the config and the innodb and global status. I hope there is somebody who might have an idea.
Version: Percona server 5.6.40-84.0-1.jessie on Debian Jessie
20180814-my-cnf.txt (1.8 KB)
20180814-global-status.csv.txt (12.4 KB)
20180814-innodb-status.txt (11.1 KB)