MariaDB 10.3 performance degradation over the time, but back to normal after restart

Hi everyone,

I have a MariaDB server with these setup:

Master
Version: 10.3.39
Memory: 64GB
CPU: 16vCPU
Disk: Network attached 500GB

I also have another 2 slaves connected to this master with lower spec setup:

Slave
Version: 10.3.39
Memory: 16GB
CPU: 8vCPU
Disk: Network attached 500GB

Slave-2
Version: 10.3.39
Memory: 8GB
CPU: 4vCPU
Disk: Network attached 500GB

However I found an issue with my master, where there is a very important process where it’s single threaded, and whenever we have a performance degradation in the database it blocks the whole process.

As you can see, the red box indicates the time where we did a restart on our mariadb service either by reboot or by restart the service.

However since this has becomes a repetitive issue we are quite stuck on how to debug this now that we have encounter another issue where the latency has spiking again after a few days of restart.

This is our .cnf configuration along with other configuration that we adjust on the fly

#
# These groups are read by MariaDB server.
# Use it for options that only the server (but not clients) should see
#
# See the examples of server my.cnf files in /usr/share/mysql

# this is read by the standalone daemon and embedded servers
[server]

# this is only for the mysqld standalone daemon
[mysqld]

#
# * Basic Settings
#
user                    = mysql
pid-file                = /run/mysqld/mysqld.pid
socket                  = /run/mysqld/mysqld.sock
#port                   = 3306
basedir                 = /usr
#datadir                 = /var/lib/mysql
datadir                 = /mnt/data/mysql
tmpdir                  = /tmp
lc-messages-dir         = /usr/share/mysql
performance-schema

sql_mode                = ""
#skip-external-locking

# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address            = <REDACTED>

#
# * Fine Tuning
#
#key_buffer_size        = 16M
max_allowed_packet     = 16M
#thread_stack           = 192K
#thread_cache_size      = 8
# This replaces the startup script and checks MyISAM tables if needed
# the first time they are touched
#myisam_recover_options = BACKUP
#max_connections        = 100
#table_cache            = 64
#thread_concurrency     = 10

#
# * Query Cache Configuration
#
#query_cache_limit      = 1M
query_cache_size        = 32M
key_buffer_size        = 16M
max_allowed_packet     = 100M
thread_stack           = 192K
thread_cache_size      = 8
max_connections        = 5000
open_files_limit        = 100000
table_definition_cache  = 10000
table_open_cache        = 10000
tmp_table_size = 32M
max_heap_table_size = 32M

#
# * Logging and Replication
#
# Both location gets rotated by the cronjob.
# Be aware that this log type is a performance killer.
# As of 5.1 you can enable the log at runtime!
#general_log_file       = /var/log/mysql/mysql.log
#general_log            = 1
#
# Error log - should be very few entries.
#
log_error = /var/log/mysql/error.log
#
# Enable the slow query log to see queries with especially long duration
#slow_query_log_file    = /var/log/mysql/mariadb-slow.log
#long_query_time        = 10
#log_slow_rate_limit    = 1000
#log_slow_verbosity     = query_plan
#log-queries-not-using-indexes
#
# The following can be used as easy to replay backup logs or for replication.
# note: if you are setting up a replication slave, see README.Debian about
#       other settings you may need to change.
#server-id              = 1
#log_bin                = /var/log/mysql/mysql-bin.log
#expire_logs_days        = 10
#max_binlog_size        = 100M
#binlog_do_db           = include_database_name
#binlog_ignore_db       = exclude_database_name

#
# * Security Features
#
# Read the manual, too, if you want chroot!
#chroot = /var/lib/mysql/
#
# For generating SSL certificates you can use for example the GUI tool "tinyca".
#
#ssl-ca = /etc/mysql/cacert.pem
#ssl-cert = /etc/mysql/server-cert.pem
#ssl-key = /etc/mysql/server-key.pem
#
# Accept only connections using the latest and most secure TLS protocol version.
# ..when MariaDB is compiled with OpenSSL:
#ssl-cipher = TLSv1.2
# ..when MariaDB is compiled with YaSSL (default in Debian):
#ssl = on

#
# * Character sets
#
# MySQL/MariaDB default is Latin1, but in Debian we rather default to the full
# utf8 4-byte character set. See also client.cnf
#
character-set-server  = utf8mb4
collation-server      = utf8mb4_general_ci

#
# * InnoDB
#
# InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.
# Read the manual for more InnoDB related options. There are many!

#
# * Unix socket authentication plugin is built-in since 10.0.22-6
#
# Needed so the root database user can authenticate without a password but
# only when running as the unix root user.
#
# Also available for other users if required.
# See https://mariadb.com/kb/en/unix_socket-authentication-plugin/

# this is only for embedded server
[embedded]

# This group is only read by MariaDB servers, not by MySQL.
# If you use the same .cnf file for MySQL and MariaDB,
# you can put MariaDB-only options here
[mariadb]
server_id              = 1
log-bin                = /var/log/mysql/log-bin/mysql-bin.log
expire_logs_days       = 7
binlog-format          = ROW
sync_binlog            = 1

gtid-domain-id         = 1
gtid_strict_mode       = 1


slave_parallel_threads = 4

# This group is only read by MariaDB-10.3 servers.
# If you use the same .cnf file for MariaDB of different versions,
# use this group for options that older servers don't understand
[mariadb-10.3]
thread_handling = pool-of-threads
thread_pool_size = 128
thread_pool_max_threads = 65536
thread_pool_stall_limit = 30

innodb_read_io_threads = 100
innodb_write_io_threads = 100
innodb_thread_concurrency = 100
innodb_doublewrite = 0
innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 128M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 48G
innodb_buffer_pool_chunk_size = 375M
innodb_buffer_pool_instances = 64

slow_query_log = 1
long_query_time = 0.1
slow_query_log_file = /var/log/mysql/slow-query.log

use_stat_tables = PREFERABLY
wait_timeout = 300
innodb_max_dirty_pages_pct = 90
innodb_max_dirty_pages_pct_lwm = 10

Also this is the screenshot of the PMM metrics during the time when the latency increased suddenly

InnoDB stats

Also we noticed a condition when we ran SHOW ENGINE INNODB STATUS command, there are a lot of things:

  • High Number of Idle Transactions:
  • A large number of transactions (identified by ---TRANSACTION entries) are listed as “not started,” indicating they are idle.
  • Each transaction has 0 lock struct(s), which means they are not holding any locks, and heap size 1128, which is the memory allocation for transaction metadata.
  • Buffer Pool and Memory:
  • Buffer pool usage seems efficient, with a high buffer pool hit rate of 999/1000, indicating that most data requests are being served from memory rather than disk.
  • However, there is a significant number of free buffers in the buffer pools, indicating that the pool size might be larger than necessary for the workload.
  • I/O Threads and Pending Operations:
  • The I/O threads are predominantly in a waiting state, suggesting that I/O operations are not the bottleneck.
  • No pending I/O operations were observed, indicating efficient disk I/O handling.
  • Log and Checkpoint Activity:
  • Log sequence numbers and flushed positions are up-to-date, with no pending log flushes or checkpoint writes, indicating efficient log management.

We have also done a couple of few things such as:

  1. Increased the buffer pool from 24GB to 48GB (however the performance degradation still occurs)
  2. Reduce the number of wait timeout (in hoping reducing the number of transaction not started in from the innodb status engine command)
  3. Adjusting innodb_max_dirty_pages_pct = 90 innodb_max_dirty_pages_pct_lwm = 10 to these values from 0 (not set)
  4. Executed use_stats_tables to PREFERRABLY

We will do another execution such as:

  1. Upgrading mariaDB version from 10.3 to 11.4 because there were multiple threads that discuss that stated there is a problem with the SQL OPTIMIZER here mysql - MariaDB 10.4 random performance degradation - Stack Overflow
  2. We will try to execute OPTIMIZE to the tables which have slower latency and affecting our critical process

Thanks!

This is graph of the latency of our process:

As you can see, the red box indicates the time where we did a restart on our mariadb service either by reboot or by restart the service.

this is the VM stats

VM stats