Issues with synchronization after upgrading from 5.7 to 8.0

Hi there,

We just upgraded a 3 nodes cluster from 5.7 to 8.0 for an write intensive environment. Right after the upgrade, we noticed that the disk IO went from about 40% to 100% all the time.

Seems it is generated from innoDB, but the total data written is the same before and after the disk usage increase


I’m not sure if this is an expected change from 5.7 to 8.0, but this seems to be triggering an issue that was not present in 5.7. A few times a day, the cluster enters into pause due to flow control from one of the servers (a random one) and there is no resume message. The cluster get stuck in pauses status

As soon as we stop mysql in the server that originated the pause, the other 2 servers will get up to date in just a second. We then resume mysql in the offending server, it does IST and all works again for some time until we get stuck again (all the time disk usages are at 100%).

I’m not a MySQL expert but wonder if that 100% disk usage can be the reason for the issue (and what can be causing that high disk usage). The cluster is receiving the exact same workload than before the upgrade.

All 3 servers are running on ubuntu 20.04, 32 cores, 192GB, 3TB SSD disks
All 3 servers have similar config files:

server-id=1
datadir=/var/lib/percona-xtradb-cluster
socket=/var/run/mysqld/mysqld.sock
log-error=/var/log/mysql/error.log
pid-file=/var/run/mysqld/mysqld.pid
log-bin=galera1-bin
log_slave_updates=1
expire_logs_days=7
symbolic-links=0
wsrep_provider=/usr/lib/galera4/libgalera_smm.so
wsrep_cluster_address=gcomm://galera1,galera2,galera3
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
wsrep_node_address=192.168.212.162
wsrep_cluster_name=gorilla-cluster
wsrep_node_name=galera1
pxc_strict_mode=ENFORCING
wsrep_sst_method=xtrabackup-v2
innodb_flush_log_at_trx_commit=0
innodb_buffer_pool_size=52224M
bind-address=192.168.212.162
innodb_lock_wait_timeout = 120
character_set_server=utf8mb4
collation-server = utf8mb4_unicode_ci
sql_mode=NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
gtid_mode=ON
enforce_gtid_consistency
log_output=file
slow_query_log=ON
long_query_time=0
log_slow_rate_limit=50
log_slow_rate_type=query
log_slow_verbosity=full
log_slow_admin_statements=ON
log_slow_slave_statements=ON
slow_query_log_always_write_time=1
slow_query_log_use_global_control=all
innodb_monitor_enable=all
userstat=1
max_allowed_packet   = 64M
innodb_log_file_size = 10G
binlog_space_limit = 70G
wsrep_retry_autocommit=4
skip_name_resolve=ON
thread_pool_size=24
join_buffer_size=8M
wsrep_notify_cmd=/var/www/gorilla/server-scripts/galeracheck.py
innodb_flush_method = O_DIRECT
pxc-encrypt-cluster-traffic=ON
wait_timeout=10800
wsrep_log_conflicts
wsrep_slave_threads=80
wsrep_provider_options="gcache.size=20G;gcs.fc_limit=400;gcs.fc_factor=0.8;gcache.recover=yes"
max_connections=3000
innodb_buffer_pool_instances=51
open_files_limit=20480
innodb_open_files=10240
table_open_cache=10240

Again, this is the config migrated from 5.7 and was working just fine. Any help on the right direction to look at will be greatly appreciated.

Sorry for my English :slight_smile:
Sergio

Hi Sergio,
Since I do not have much detail about your servers or current workload, I can’t give you the exact reason.
But there are gazillions of changes between MySQL 5.7 and 8.0, so, one reason for the cluster being stuck would be query issues.
I would start by optimizing queries since some of them may be slower after the upgrade due to changes in the optimizer.
Handler_read_* statuses will give you an insight if your queries need to read more rows to reach out to the data.

Thanks for your response. In fact read handlers lowered after the update. What is going x4 now is the data writes (count) for the same total amount of data (MB). Wonder if could be related to the double write buffer

You have 51GB buffer pool, but 192GB RAM? What are you doing with the extra RAM? Also, 51 buffer pool instances falls outside best practices. I would lower that down to 16.

The double write buffer is for InnoDB safety. Disable at your own risk of data loss. But, if you lower the number of buffer pool instances, you will also decrease the number of double-write buffers. :+1:

You also have just about every logging option enabled: slow log full, userstat, innodb monitor, etc. That’s a lot of additional disk writing. Turn all those off and see what happens to your disk IO.

Use PMM and look for the Performance Schema Wait graphs to see specifically what subsystem is using all the disk IO.

Hi, thanks for the response. After applying some of your recommendations, we don’t get server locks anymore. Disk I/O is still at 100% (maybe for the increased performance after the upgrade), but the cluster don’t enter into paused state anymore.