random freeze after memory upgrade (5.0.87, 5.1.34) [SOLVED]

Hi, I tried to solve this issue for few weeks, searched everywhere and cannot find the cause of this issue.

We have two nearly equal mysql/innodb boxes in master-master replication. The setup used to be:
32GB RAM , innodb_buffer_pool=20G

Then we upgraded the memory:
64GB RAM , innodb_buffer_pool=40G

The upgrade became a problem.

After the buffer pool fills up to about 25GB, we see random freeze for 1-2 seconds. It appears about 20-100 times per hour, or so, I don’t see any regular fashion in this. The frequency of freezing gets worse while the workload becomes more write-intensive (batch data importing). I think it definitely must be some locking issue, because the box load (4CPU cores, AMD) is ususally lower than 1.0 and iowait something between 0 and 5 percent, we don’t se any spikes in the cpu usage.

I don’t think it’s an hardware issue, both machines are affected in the same way. I suspect some backround job like innodb checkpointing, but I’m not sure.

What we tried and didn’t help:

  • disable query cache
  • set another innodb_thread_concurrency (0, 8, 12)
  • upgrade from 5.0.87 to 5.1.34
  • increase innodb_log_size from 200M to 600M
  • change innodb_max_dirty_pages_pct to 20 or 30
  • change thread_cache_size between 0 and 20
  • change binlog_format between statement and mixed
  • optimize table, analyze table

What helps:

  • decreasing innodb_buffer_pool to 15G (of course, it decreases overall performance and raises disk usage)

30GB buffer pool seems to produce less freezing than 40GB - I assume it might be connected with random innodb index dives in the optimizer?? There has been a bug concerning this when SHOW STATUS was issued, but It has been already fixed.

Our config:
skip-name-resolve
character-set-server=UTF8
collation_server=utf8_general_ci

slave-skip-errors=1210

server-id=4
log-bin=/var/lib/mysql/server4-bin
relay-log=/var/lib/mysql/server4-relay-bin
max_binlog_size=1000M
expire_logs_days=1
binlog-format=MIXED
binlog-ignore-db=temp
replicate-wild-ignore-table=main_db.tempStorage%
replicate-wild-ignore-table=test.%
replicate-wild-ignore-table=mysql.%
sync_binlog=0

slow_query_log = /var/log/mysql/slow-query.log
long_query_time = 1

skip-external-locking
max_connections = 1000
max_connect_errors = 10000
max_user_connections = 500

myisam_sort_buffer_size = 5M
thread_cache_size = 20
read_rnd_buffer_size = 5M
read_buffer_size = 2M
join_buffer_size = 2M
key_buffer_size = 100M
table_cache = 1000
sort_buffer_size = 2M

innodb_file_per_table
innodb_buffer_pool_size = 40G
innodb_log_file_size = 600M
innodb_flush_method=O_DIRECT
innodb_flush_log_at_trx_commit=2
innodb_open_files=1000
innodb_thread_concurrency=12

tmp_table_size = 100M
max_heap_table_size = 100M

query_cache_size = 20M
query_cache_limit = 100K
group_concat_max_len=512

Database type:
60GB InnoDB + 5GB MyISAM for logging (INSERT DELAYED)

HW and SW setup
SuperMicro 1U
2CPU AMD 2800MHz, 2 cores each, 4 cores in total
64GB RAM 667 MHz
RAID controller with write back cache and BBU
4 HDDs 15K RPM
CentOS 5.3
Ext3 filesystem, LVM
mysql dir (/var/lib/mysql) is a soft link to /data/mysql

Examples of the processlist and innodb status are enclosed

Thanks a lot for any ideas!

Vojtech

Today I ran some query profiling, here is an example (query fetches one row from an innodb table, using secondary index):

1 row in set (1.13 sec)±-------------------±---------+| Status | Duration |±-------------------±---------+| starting | 0.000089 || Opening tables | 0.000018 || System lock | 0.000009 || Table lock | 0.000014 || init | 0.000027 || optimizing | 0.000013 || statistics | 1.135322 || preparing | 0.000038 || executing | 0.000006 || Sending data | 0.000027 || end | 0.000009 || query end | 0.000006 || freeing items | 0.000036 || logging slow query | 0.000007 || logging slow query | 0.000047 || cleaning up | 0.000008 |±-------------------±---------+16 rows in set (0.01 sec)

“statistics” status is the winner. I’m looking for the innodb plugin 1.0.4, compatible with 5.1.34 - hope it solves this…

Turning adaptive hash index off - didn’t help. Still looking forward to innodb plugin 1.0.4 …

Tested 5.1.37 + innodb plugin 1.0.4 - slight improvement. Pileups last about 1.5 seconds instead of 2.3 seconds on 5.1.35

It’s definitely connected with innodb checkpointing, the rhythm of checkpoints and pileups is equal. Maybe I should try the adaptive checkpointing in XtraDB.

From the ‘bible’ (High Performance MySQL, 2nd ed.):

[B]Quote:[/B]
In most cases, you should make the InnoDB buffer pool as large as your available memory allows. However, in rare circumstances, very large buffer pools (say, 50 GB) can cause long stalls. For example, a large buffer pool may become slow during checkpoints or insert buffer merge operations, and concurrency can drop as a result of locking. If you experience these problems, you may have to reduce the buffer pool size.

Thanks!

It’s funny, I already own this book, but have not came across this paragraph yet )

This info makes me little bit disappointed, because it definitely describes our issue and it seems to be insolvable today.
I’ll try XtraDB 1.0.3-7, that will be the last try before lowering the buffer pool and rethinking the app architecture.

XtraDB 7, Revision 43 performs little bit better than innodb plugin 1.0.4, the adaptive checkpointing is set to ‘estimate’.

However, the problem still occurs.

Thanks for posting your new findings. I will probably run into this issue some day so I will keep reading.

Upgraded to XtraDB 9 (MySQL 5.1.42), problem still exists. However, I haven’t done any benchmarks yet to see any improvement.

I think I have identified the source of these lockups. I have found a simple testcase, that immediately causes a lockup:

1.) open a connection (mysql cli)
2.) execute few hundred UPDATEs, that cause invocation of some triggers with stored procedures
3.) close the connection

The connection termination causes the lockup. I think this behaviour is similar to this bug: http://bugs.mysql.com/bug.php?id=41804

Konstantin wrote a post, where he claims this and other related bugs were fixed last year and these changes are pushed to next release tree: http://kostja-osipov.livejournal.com/29104.html.
I cannot wait to see them in GA release )

I don’t know, why this problem gets worse when I increase the size of the buffer pool - maybe the connection termination causes some buffers to be scanned. I’m not mysql-insider, there might be a lot of other dependencies.

Now I’m going to track all trigger/SP combinations in our application and remove them if possible.

So the final conclusion:

mysql> CREATE TEMPORARY TABLE temp_test (col1 INT) ENGINE=INNODB;
Query OK, 0 rows affected (0.02 sec)

mysql> Bye

No triggers/SP needed, the problem is INNODB temporary table, which is dropped automatically when the connection terminates.

Does anyone now any bug report concerning temporary innodb tables?

The problem exists while dropping any InnoDB table.
I have filed a bug report:

http://bugs.mysql.com/bug.php?id=51325&thanks=sub

If the connection with innodb temporary table terminates, SHOW PROCESSLIST lists many stucked connections in “statistics” state.
However, if you drop the table before closing the connection, stucked queries in SHOW PROCESSIST are in “opening table” state.

Today, I have removed most these problematic tmp tables from our application and now HTTP responses are the best we’ve ever had.