A few weeks ago we upgraded our slaves to 5.6.13 from the last 5.6 release candidate version. This resolved one of the issues we were seeing in the RC version, but it appears to have introduced a memory leak. The leak is very small, but slowly used all available memory (these 128GB boxes), the buffer pool is 72GB.
So the servers were using all of the available memory and Linux would reap the process (oom-killer). It took a few weeks for this happen, it is small leak. We might not have even noticed the issue because mysqld is restarted. In our case we don’t start slave on restart since we want to verify data before we start replication again. Our monitoring triggered an alert and that is when we reviewed the memory use and system logs and found the oom-killer entry in /var/log/messages. This happened on all of the slaves that were on 5.6.13.
We have masters that have not been upgraded yet and they don’t appear to be having problem.
I found a couple of notes on the 5.6.14 that suggests there was a known memory leak, but they seem to suggest it is a continual issue:
Has anyone else seen this? Could the bugs listed above be the problem? If they are does that mean that each thread created leaks memory?
Workload varies for the slaves, some have read traffic while others do not, but they all leaked memory at nearly the same rate.
UPDATE - Tried two things in an attempt to resolve/reduce the memory leak.
1 - Turned off performance_schema, this did not change the memory leak pattern, but it did change (improve?) the overall memory usage. I made this change since users had reported a difference in memory usage then they migrated from 5.5 to 5.6
2 - Upgraded to 5.6.14 on some of the servers - This did not resolve the problem, but it might have slowed it.
Next test is to upgrade and disable the performance_schema on the same server, the initial tests were done in isolation.
I also forget to mention that we use the tcmalloc , so after the above test we will try alternate mallocs to rule that out. However all servers are using tcmalloc, so the only difference originally was the MySQL version and role.
The only other thing I can think of is that the replication code has the leak since it is only happening on slaves.