Memory Leak in master & slave

Hey there,

We had a crush of traffic for a customer of ours during Black Friday and spun up a cluster to help deal with it, however, there is a slow and steady memory leak that requires us to reboot the cluster about once a day to clear the RAM as it’s not releasing on terminating the process. Is there something we’re not doing in the configuration properly? It applies to both the master + the slave. Everything else works perfectly, just the memory leak is a pain. The DB we’re running on the system is a 17.4GB database.

Any help would be appreciated! :smiley:

Here’s our my.cnf on the master:

[mysqld]
datadir = /data/mysql
thread_cache_size = 50
tmp_table_size = 32M
max_heap_table_size = 32M
max_allowed_packet=24M
wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_cluster_address=gcomm://
wsrep_slave_threads=64
wsrep_sst_method=rsync
wsrep_cluster_name=percona_cluster
binlog_format=ROW
innodb_autoinc_lock_mode=2
innodb_locks_unsafe_for_binlog=1
innodb_buffer_pool_size = 20000M

character-set-server = utf8
skip-external-locking
log_warnings
skip_name_resolv

innodb_additional_mem_pool_size = 20M
#innodb_log_file_size = 400M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 2

innodb_lock_wait_timeout = 50
innodb_io_capacity = 1500
innodb_read_io_threads = 8
innodb_write_io_threads = 8

innodb_buffer_pool_restore_at_startup = 500
innodb_locks_unsafe_for_binlog=1

table_cache = 512
thread_cache_size = 1000

query_cache_type = 0
back_log = 128
thread_concurrency = 50

tmpdir = /tmp
max_connections = 5000
max_allowed_packet = 24M
max_join_size = 4294967295

net_buffer_length = 2K
thread_stack = 128K
tmp_table_size = 64M
max_heap_table_size = 64M

log-error = /data/log/mysql/error.log

##new stuff on reboot

innodb_use_sys_malloc =0

#innodb_flush_method = O_DIRECT
#innodb_log_files_in_group = 2
#innodb_file_per_table = 1

log_queries_not_using_indexes = 1
slow_query_log = 1
slow_query_log_file = /data/mysql/mysql-slow.log
table_definition_cache = 4096
table_open_cache = 4096

#new stuff for reboot to solve ram gobbling issue

log-bin=mysql-bin
log_slave_updates=1
max_binlog_size=10G

#end new stuff

[mysqldump]
quick
max_allowed_packet = 24M

[mysql]
no-auto-rehash

Remove the next comment character if you are not familiar with SQL

#safe-updates

Here’s the My.cnf on the slave:

[mysqld]
datadir = /data/mysql
thread_cache_size = 50
tmp_table_size = 32M
max_heap_table_size = 32M
max_allowed_packet=24M
wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_cluster_address=gcomm://(removed to protect the innocent)
wsrep_slave_threads=64
wsrep_sst_method=rsync
wsrep_cluster_name=percona_cluster
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_locks_unsafe_for_binlog=1
innodb_buffer_pool_size = 20000M

character-set-server = utf8
skip-external-locking
log_warnings
skip_name_resolv

innodb_additional_mem_pool_size = 20M
#innodb_log_file_size = 400M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 2

innodb_lock_wait_timeout = 50
innodb_file_per_table
innodb_io_capacity = 1500
innodb_read_io_threads = 8
innodb_write_io_threads = 8

innodb_buffer_pool_restore_at_startup = 500
innodb_locks_unsafe_for_binlog=1

table_cache = 512
thread_cache_size = 1000

query_cache_type = 0
back_log = 128
thread_concurrency = 50

tmpdir = /tmp
max_connections = 5000
max_allowed_packet = 24M
max_join_size = 4294967295

net_buffer_length = 2K
thread_stack = 128K
tmp_table_size = 64M
max_heap_table_size = 64M

log-error = /data/log/mysql/error.log

innodb_use_sys_malloc =0
log_queries_not_using_indexes = 1
slow_query_log = 1
slow_query_log_file = /data/mysql/mysql-slow.log
table_definition_cache = 4096
table_open_cache = 4096

#new stuff for reboot to solve ram gobbling issue

log-bin=mysql-bin
#log_slave_updates=1
max_binlog_size=10G

#end new stuff

[mysqldump]
quick
max_allowed_packet = 24M

[mysql]
no-auto-rehash

Remove the next comment character if you are not familiar with SQL

#safe-updates

Anybody have any thoughts here? This exact some config (minus the wsrep stuff) works just fine on a stand-alone Percona server (non-XraDB Cluster) without a memory leak. Do any of the Percona engineers have any insight?

Each node (both master and slave) is running on a CentOS 6.3 (with all latest updates) on a dedicated server over at Joyent.com.

Any thoughts anyone?

I don’t have any solution or workaround but I have also created a topic to this forum some time ago of my observations related to this memory issue, without any resolution so far. I have added a bug report to launchpad: bugs.launchpad.net/percona-xtradb-cluster/+bug/1078759 No progress yet. (Maybe voting could get their attention).

I have also evaluated MariaDB Galera cluster and I’m getting identical results. I have created ticket to there also ( mariadb.atlassian.net/browse/MDEV-3848?page=com.atlassian.st reams.streams-jira-plugin:activity-stream-issue-tab). Since it seems that there’s growing number of people getting affected by this bug I would say it’s pretty severe.

Couple of things:

a) Does your leak match the conditions/constraints described in Bug #1078759 “Excessive Memory usage” : Bugs : Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC ?
If yes, then you can follow it there.

b) If not, then I see a couple of things with config:

innodb_use_sys_malloc =0

#innodb_flush_method = O_DIRECT
#innodb_log_files_in_group = 2
#innodb_file_per_table = 1
#innodb_log_file_size = 400M

Why are you not using sys malloc - unless you are on a system
with very old broken glibc, this is not recommended - which is
why the default is 1.

Also, I see innodb_flush_method being commented out, any reason
for this? O_DIRECT is recommended for the precise reason that
when memory is being managed by innodb, the ibdata/ibd files are
not cached in memory.

Also, even though not directly relevant here,
innodb_log_file_size is 5M by default, is that fine for your
workload?

c) Finally, you can also check if you are affected by Bug #1112514 “DML on temporary table tries to append keys for pr...” : Bugs : MySQL patches by Codership which has been fixed in latest PXC.