Upgraded replica to 8.0 resulted in replica lag

We have a S-M-M-S replication with 5.7 and we planned to migrate to 8.0.
I’ve started with a replica that is used only as a fallback and backups (for obvious reasons) and the decrease in performance was appalling. It couldn’t catch up with the master and the lag kept increasing.

Only when I set sync-binlog = 0 it caught up and runs fairly ok. It ran fine with the sync set to 1 on 5.7 and I wouldn’t trade consistency for performance if it’s possible.

Here’s the my.cnf ast it was before and after migration

[mysqld]
sql_mode=NO_ENGINE_SUBSTITUTION
# MyISAM #
key-buffer-size                = 32M

# SAFETY #
max-allowed-packet             = 16M
max-connect-errors             = 1000000
skip-name-resolve

# DATA STORAGE #
datadir                        = /var/lib/mysql/

# BINARY LOGGING ########## modified 4 slave replication
server-id                      = 4
binlog_do_db                   = dbname
log-bin                        = /var/lib/mysql/mysql-bin
#expire-logs-days               = 14    deprecated in 8.0
binlog_expire_logs_seconds      = 1360800  # 8.0 - that's 14 days in seconds
sync-binlog                    = 0    # was 1 on 5.7
binlog_format                  = ROW
#relay_log_info_repository=TABLE       deprecated in 8.0
relay_log_recovery=ON

# CACHES AND LIMITS #
tmp-table-size                 = 32M
max-heap-table-size            = 32M
#query-cache-type               = 0       deprecated in 8.0
#query-cache-size               = 0       deprecated in 8.0
max-connections                = 2000
thread-cache-size              = 50
open-files-limit               = 10000  #max 10k in 8.0
table-definition-cache         = 1024
table-open-cache               = 2048

# INNODB #
innodb-flush-method            = O_DIRECT
#innodb-log-files-in-group      = 2       deprecated in 8.0
#innodb-log-file-size           = 100M    deprecated in 8.0
innodb-file-per-table          = 1
innodb-buffer-pool-size        = 40G
innodb_buffer_pool_instances   = 3
innodb_flush_log_at_trx_commit = 2
innodb_io_capacity             = 1000

# Parallel Replication #
#slave_parallel_type=LOGICAL_CLOCK
#slave_parallel_workers=20
#slave_preserve_commit_order=1


# LOGGING #
log-error                      = /var/lib/mysql/mysql-error.log
slow-query-log                 = 1
slow-query-log-file            = /var/log/mysql/mysql-slow.log

Do you see something that’s obviously wrong so I can try changing that instead of sync-binlog?
I’m rather reluctant to upgrade the masters in these circumstances.

I see a couple things, but nothing night/day.

innodb_io_capacity should really never be changed from default of 200.

Why have you disabled parallel replication? Your master (now called “source”) is multi-threaded, so should your replicas. I would enable parallel replication workers at equal to 1/2 number of CPUs/cores.

Examine your replicated queries. Are they INSERTs or UPDATEs? If UPDATEs, is there a WHERE clause with the primary key? How does the disk IO look on the 8.0 replica compared to 5.7 source?

Hi Matthew, thank you for your response.

Parallel replication was commented in the 5.7 setup. I tried it at the time, did not see any performance improvement, and I think i read that it wasn’t doing much in versions below 8.0, when the whole parallelism was improved.
Uncommenting those was actually what helped plateau the lag and made it not increase continuously. So it’s enabled now.

Changing innodb_io_capacity back to 200 hasn’t improved performance.

I don’t know how relevant is looking at the IO comparing the replica and the primary, they are different machines, source is under heavy load, replica runs other services as well.
I’ll try to go through all the settings again, see if I’m missing something.

Hi @Add

I tried it at the time, did not see any performance improvement
Can you share the configuration options you used while setting it up? Parallel replication is working well for many of the customers in 5.7.

For sure your primary and replica configuration difference is also very important parameter that may affect the lag. I usually choose pt-config-diff to verify the config differences.

I’d suggest to evaluate the effectiveness of the config you should read through this blog and see how is it performing before tweaking the configuration further.

UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE 'events_transactions%' limit 10;
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME = 'transaction' limit 10 ;
select performance_schema.events_transactions_summary_by_thread_by_event_name.THREAD_ID AS THREAD_ID,
performance_schema.events_transactions_summary_by_thread_by_event_name.COUNT_STAR AS COUNT_STAR
from performance_schema.events_transactions_summary_by_thread_by_event_name
where performance_schema.events_transactions_summary_by_thread_by_event_name.THREAD_ID
in (select performance_schema.replication_applier_status_by_worker.THREAD_ID
  from performance_schema.replication_applier_status_by_worker);

Following are the sample parameters that you might want to consider tweaking.

SELECT @@slave_parallel_workers, @@slave_exec_mode, @@slave_parallel_type, @@slave_preserve_commit_order, @@slave_rows_search_algorithms,@@binlog_transaction_dependency_tracking\G

Thanks,
K

These are the settings on the Source (running fine on 5.7):

[mysqld]
sql_mode=NO_ENGINE_SUBSTITUTION
explicit_defaults_for_timestamp=OFF


# MyISAM #
key-buffer-size                = 32M

# SAFETY #
max-allowed-packet             = 16M
max-connect-errors             = 1000000
skip-name-resolve

# DATA STORAGE #
 datadir                        = /var/lib/mysql/

# BINARY LOGGING ########## modified 4 master-master replication
server-id                      = 11
binlog_do_db                   = dbname
log-bin                        = /var/lib/mysql/mysql-bin
log-slave-updates
expire-logs-days               = 7
sync-binlog                    = 1
binlog_format                  = ROW
auto_increment_increment       = 2
auto_increment_offset          = 1
relay_log_info_repository=TABLE
relay_log_recovery=ON

# CACHES AND LIMITS #
tmp-table-size                 = 32M
max-heap-table-size            = 32M
query-cache-type               = 0
query-cache-size               = 0
max-connections                = 2000
thread-cache-size              = 50
open-files-limit               = 65535
table-definition-cache         = 1024
table-open-cache               = 2048

# INNODB #
innodb-flush-method            = O_DIRECT
innodb-log-files-in-group      = 2
innodb-log-file-size           = 100M
innodb-flush-log-at-trx-commit = 1
innodb-file-per-table          = 1
innodb-buffer-pool-size        = 50G
innodb_buffer_pool_instances   = 3
innodb_io_capacity             = 1000

# LOGGING #
log-error                      = /var/lib/mysql/mysql-error.log
slow-query-log                 = 1
slow-query-log-file            = /var/log/mysql/mysql-slow.log
long_query_time                = 4

What baffles me is the difference in performance from sync-binlog=1 to sync-binlog=0. From the graphs of the queries, it’s severals orders of magnitude faster:

I understand why there would be a difference, but I don’t understand why it’s so massive.

I’ve been keeping an eye on this server and noticed that Writing on the disk seems to be the bottleneck. Every time the Replica lags behind, it’s because disks are busy. They are some consumer grade SSDs in a raid 10 array. Nothing changed from before upgrade, so I imagine there’s a change in the way writing or caching or flushing is done in the new version.