We have a S-M-M-S replication with 5.7 and we planned to migrate to 8.0.
I’ve started with a replica that is used only as a fallback and backups (for obvious reasons) and the decrease in performance was appalling. It couldn’t catch up with the master and the lag kept increasing.
Only when I set sync-binlog = 0 it caught up and runs fairly ok. It ran fine with the sync set to 1 on 5.7 and I wouldn’t trade consistency for performance if it’s possible.
Here’s the my.cnf ast it was before and after migration
Do you see something that’s obviously wrong so I can try changing that instead of sync-binlog?
I’m rather reluctant to upgrade the masters in these circumstances.
innodb_io_capacity should really never be changed from default of 200.
Why have you disabled parallel replication? Your master (now called “source”) is multi-threaded, so should your replicas. I would enable parallel replication workers at equal to 1/2 number of CPUs/cores.
Examine your replicated queries. Are they INSERTs or UPDATEs? If UPDATEs, is there a WHERE clause with the primary key? How does the disk IO look on the 8.0 replica compared to 5.7 source?
Parallel replication was commented in the 5.7 setup. I tried it at the time, did not see any performance improvement, and I think i read that it wasn’t doing much in versions below 8.0, when the whole parallelism was improved.
Uncommenting those was actually what helped plateau the lag and made it not increase continuously. So it’s enabled now.
Changing innodb_io_capacity back to 200 hasn’t improved performance.
I don’t know how relevant is looking at the IO comparing the replica and the primary, they are different machines, source is under heavy load, replica runs other services as well.
I’ll try to go through all the settings again, see if I’m missing something.
I tried it at the time, did not see any performance improvement
Can you share the configuration options you used while setting it up? Parallel replication is working well for many of the customers in 5.7.
For sure your primary and replica configuration difference is also very important parameter that may affect the lag. I usually choose pt-config-diff to verify the config differences.
I’d suggest to evaluate the effectiveness of the config you should read through this blog and see how is it performing before tweaking the configuration further.
UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE 'events_transactions%' limit 10;
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME = 'transaction' limit 10 ;
select performance_schema.events_transactions_summary_by_thread_by_event_name.THREAD_ID AS THREAD_ID,
performance_schema.events_transactions_summary_by_thread_by_event_name.COUNT_STAR AS COUNT_STAR
from performance_schema.events_transactions_summary_by_thread_by_event_name
where performance_schema.events_transactions_summary_by_thread_by_event_name.THREAD_ID
in (select performance_schema.replication_applier_status_by_worker.THREAD_ID
from performance_schema.replication_applier_status_by_worker);
Following are the sample parameters that you might want to consider tweaking.
What baffles me is the difference in performance from sync-binlog=1 to sync-binlog=0. From the graphs of the queries, it’s severals orders of magnitude faster:
I’ve been keeping an eye on this server and noticed that Writing on the disk seems to be the bottleneck. Every time the Replica lags behind, it’s because disks are busy. They are some consumer grade SSDs in a raid 10 array. Nothing changed from before upgrade, so I imagine there’s a change in the way writing or caching or flushing is done in the new version.