Memory Leak - 5.6.13

A few weeks ago we upgraded our slaves to 5.6.13 from the last 5.6 release candidate version. This resolved one of the issues we were seeing in the RC version[1], but it appears to have introduced a memory leak. The leak is very small, but slowly used all available memory (these 128GB boxes), the buffer pool is 72GB.

So the servers were using all of the available memory and Linux would reap the process (oom-killer). It took a few weeks for this happen, it is small leak. We might not have even noticed the issue because mysqld is restarted. In our case we don’t start slave on restart since we want to verify data before we start replication again. Our monitoring triggered an alert and that is when we reviewed the memory use and system logs and found the oom-killer entry in /var/log/messages. This happened on all of the slaves that were on 5.6.13.

We have masters that have not been upgraded yet and they don’t appear to be having problem.

I found a couple of notes on the 5.6.14 that suggests there was a known memory leak, but they seem to suggest it is a continual issue:

[URL]https://bugs.launchpad.net/percona-server/+bug/1233294[/URL]

[URL]https://bugs.launchpad.net/percona-server/+bug/1167487[/URL]

Has anyone else seen this? Could the bugs listed above be the problem? If they are does that mean that each thread created leaks memory?

Workload varies for the slaves, some have read traffic while others do not, but they all leaked memory at nearly the same rate.

UPDATE - Tried two things in an attempt to resolve/reduce the memory leak.

1 - Turned off performance_schema, this did not change the memory leak pattern, but it did change (improve?) the overall memory usage. I made this change since users had reported a difference in memory usage then they migrated from 5.5 to 5.6

2 - Upgraded to 5.6.14 on some of the servers - This did not resolve the problem, but it might have slowed it.

Next test is to upgrade and disable the performance_schema on the same server, the initial tests were done in isolation.

I also forget to mention that we use the tcmalloc , so after the above test we will try alternate mallocs to rule that out. However all servers are using tcmalloc, so the only difference originally was the MySQL version and role.

The only other thing I can think of is that the replication code has the leak since it is only happening on slaves.

  • Aaron

[1] - [URL=“http://www.percona.com/forums/questions-discussions/mysql-and-percona-server/percona-server-5-6/10987-5-6-12rc-error-1756-slave-coordinator-replication”]http://www.percona.com/forums/questi...or-replication[/URL]

I come across the same problem at mysql-5.6.15, and I’m sure it is caused by replication of slave. when you execute stop slave sql_thread, the memory is come back , so it is caused by sql thread, I am working on it right now. do you have other information?

I have isolated one server that isn’t showing the same pattern. That server has two differences from other servers:

  • has slave_parallel_workers vs slave-parallel-workers (underscores vs. hyphens) which could be related to the leak mentioned in the release notes for 5.6.14
  • query cache is disabled

guyue.zql so you have been able to verify that stopping the sql_thread will free up the memory, but it will then leak again or does the restart stop it from happening again as well?

I am going to test this on our servers and see if I see the same results.

UPDATE - Stopping and restarting sql_thread did not free memory in my case. I have changed the config on another server to match the underscores on the server that doesn’t seem to have the problem. I am going to upgrade another server to 5.6.15 and see if the same behavior exists with no change to the config as well.

  • Aaron

HI, 5615 has the same problem in my case , besides , stop slave thread only free few memory , not all of it

HI, it is verified as a bug, please view [url]MySQL Bugs: #71197: mysql slave sql thread memory leak for detail information.

5.6.17 includes the bug fix that resolves this issue. We have upgraded and early indicators suggest for our use case the issue has been resolved.

  • Aaron