Replication lag on MySQL server after increasing the number of inserts on the master

I’ve ran recently to a problem. We have a MySQL 5.7 Innodb replica with RBR replication (8 slave parallel workers). Now, until this Monday, our master executed 2500 inserts per second (in average), and the replica seemed to be fine with that. But recently the number of inserts went to 2800 inserts per second, and the replica does not seem to catch on (at evenings maybe). I checked the slave status. It seems like the IO thread seems to work just fine (there is no problem with getting events from the master), but the SQL threads don’t run that fast. So the question is - how can I improve that - increase of parallel workers does not seem to help at all.

By the way, here is the result of sql show status:

I have a MySQL 5.7 innodb replica with row based replication. I’ve noticed something curious. The master has about 3000 inserts per second, and the replica seems to catch up fine with that. But if I run a long-time query (let’s say for a minute or so) scanning the big table with 300 million rows, the replica starts to lag. I’ve checked the slave status, the IO thread seems to be reading from master just fine, but the slave sql thread seems not to be doing so well. Not sure why is that. Now zabbix shows me, that my select query seems to be causing high disk utilization (since it needs parts of the table that are on disk), I guess that could slow down the reading of relay log file or applying the transactions. Not sure what I have to tune here in order to get rid of this - how come a single select query causes such a heavy replication log (the number of inserts on the replica falls down to 1500 inserts per second instead of 3000).

I finally found the solution myself. First, I increased the number of slave_parallel_workers to 4, but that gave me nothing (even with LOGICAL_CLOCK), because the transactions were highly dependent. However, it turned out, that after I increased on master binlog_group_commit_sync_delay to 10000 (that is, 10 milliseconds), the lag disappeared. This setting is very important, since it is the setting that actually allows the replication servers to execute something in parallel.

Hi there, thank your for posting your solution it may well help someone else. Sorry we didn’t get to you more quickly but really pleased that you found an answer. I’ll highlight your solution to the team here too. :slight_smile:

I actually suggest two things. One - someone from the mysql community should really fix the documentation for slave_parallel_workers giving at least some hint for binlog_group_commit_sync_delay. Second - I would like to write something in the blog about our experience, since it was a really curious one.

Well I would love that, it could be very suitable for our new Community Blog https://www.percona.com/community-blog/
Could you email me? lorraine.pocklington@percona.com reaches me. Thank you!!