MySQL 8.0.37 with 2 Replicas, one with replica lag, the other isn't

Sorry this forum must be full of this.

But I have been struggling with this for a long time now. I have one primary instance with 2 replicas, and one is having delay every morning when the load is heavy.

The Data is a mix of InnoDB and MyISAM tables each on separated disks

Here are the config files (of both replicas)

binlog_format   = MIXED
sync_binlog     = 0
max_allowed_packet      = 128M
thread_stack            = 192K
thread_cache_size       = 8

table_definition_cache  = 8192
table_open_cache        = 1500
open_files_limit        = 2500
replica_parallel_workers = 8
relay_log_recovery      = ON
relay_log_space_limit   = 50G
max_heap_table_size     = 128M
tmp_table_size          = 128M

innodb_file_per_table = 1
innodb_flush_log_at_trx_commit = 2

The replicas are on 2 differents data centers, but I don’t think the hardware is the issue there. The replication was working fine before upgrading to MySQL 8.

On PMM, I noticed that the hourly written relay log is much more heavy on the laggy replica than on the functionnal one (<40G vs 8G ).

As the replicas have the same parameters for replication, I don’t understand this difference in replicated logs.

If the config files are exactly the same, and the data being replicated is the same (ie: no replication filters), then we need to examine the hardware. Please setup Percona Monitoring and Management (PMM) to monitor all 3 hosts. We need to see CPU/Disk/Memory/Network for each host in order to make a proper diagnosis.

Thank you for your response,

The Primary instance (32vCPU 128GB RAM) and the Problematic replica (16vCPU 64GB Ram) are on AMD EPYC 7571 inside VMs (in the same DC).

The Replica that doesn’t show issues is on bare metal 32vCPU 128GB RAM on AMD EPYC 7371 16-Core Processor

Here are some summary graphs and replication status on the machines :

Primary :

Replica Working correctly :

Replica with Lag :

The replication lag starts at 7:30 when the load is heavy.
and you can see that the relay logs written/h are heavy on the problematic replica

Please tell me if you need other data.

Ok. The replica which is lagging behind is 1/2 as powerful as the source. Best practice is that source and replica should have the same hardware/configuration.

Also, looking at your source graph, ‘MySQL Handlers’. You have 2 million op/s for READ_RND_NEXT which represents FULL TABLE SCAN. The next highest operations are Index Reads at 674K op/s.

You have a much bigger problem with bad queries than replication lag. Improving queries will not only reduce your near-max-cpu problem, but will help replication lag as well.