I have recently started to experience very poor performance from our MariaDB installation as some time passes after a restart. We run MariaDB 5.2.9 on a CentOS release 5.7 (Final) server, it has 32 GB of memory and just short of a 200 GB total database size. After a lot of investigations and not being able to find any problem of significance I am a bit out of ideas.
As the problem only occurs after some time I guess it is because of memory or buffer pool problems. All of our tables are INNODB. These are the INNODB setting from the
my.cnf file.
If I restart MySQL and immediately run a full system (by schema) backup using mysqldump command or a mk-table-checksum toolkit command (To compare multiple replicas) over the complete database the backup completes in approximately 5 hours as does the checksum command. The following day if the backup or checksum is run again it can take more than 15 hours to complete.
Everything runs in a VMware 5.1 environment. I can migrate systems from one host to another and than back to the original host but results vary from test to test and I suspect that this is relates to what tables are in use at the particular time.
The only strange status values I can find is that the pages_data actually greater than the pages_total which might explain the ridiculous value for the pages_misc value.
I forgot to add that we have 4 compressed tables ‘key block size =4’ and it is when these tables start to be processed by the backup that the buffer pool pages status figures go bad and performance drops to standstill. Until these tables are started to be processed the buffer pool is full but performance is OK and pages are loading and evicting from the buffer as expected.
HI, I have some more ‘findings’. I have on a test system upgraded the MariaDB to the latest stable version (V5.5.31) and the server still has the same problem.
I restart the server and immediately start a --all-databases mysqldump. As the backup runs it first fills up the pages_data to basically equal the pages_total and then processes away quite happily. As soon as the compressed table is reached the status figures go ‘bad’. The backup still continues (very slowly) and does complete. As it happens the very last table to back up on our server is a compressed table, so at completion of the backup the status figures remain ‘bad’ but static when displayed. If I then start another backup that does not include any compressed files the pages_data figure gradually reduces to the point where it matches the pages_total and when this happens the pages_misc instantly reverts to a normal value. See the figures below. I am still testing on performance/throughput at the various stages.
Any suggestions on how to fix this would be much appreciated, or is this a MySQL bug.
peterf, if you checksum or dump only the compressed tables after a cold start, does performance degrade similar to when doing those procedures on the non-compressed tables first? If the total size of the compressed tables versus the non-compressed tables is larger to cause buffer pool eviction during the checksum or dump, it can definitely produce a significant amount of reads and writes where you will start to notice the degradation. Also note that for compressed tables, both the compressed and non-compressed values will be stored in the buffer pool.
Hi, Thanks for your interest. I have not tested dumping only the compressed tables. The size of the tables is always far in excess of any buffer size that I can set. My test system has only 4GB total memory and I switch buffer pool size around from 1 GB up to 3 GB for different test. The table size of the compressed file is 2.5GB compressed and 20GB uncompressed, so the buffer always quickly fills up and starts evicting data. I have run so many test lately and I get different results from time to time I am not sure what is happening. My most recent tests suggest that the stats above relating to pages_data and pages_misc apart from having strange values are infact not impacting performance at all.
About the only hard fact that I can say is that when performance is bad (say when I run a checksum test over 2 systems) the ‘mysqld’ process has 1 of the systems 2 CPU’s maxed out at 100% and that even after 30 mins or more this system does not respond with the checksum result but the 2nd system will have sent its checksum result back in just a number of seconds. The system with the problem shows no other stress at all (no swapping, no disk I/O pressure or anything that I can find).
I think I will be doing a lot more testing and trouble shooting to lock down what the problem is.