Replication lag after upgrade to Percona Server 8.0

Since we have upgraded from 5.7 to 8.0 (8.0.30-22), the servers don’t appear to be performing as well with the replication. The will frequently show 1 second repl lag, sometimes shooting up to 30 seconds. With the 5.7 servers we were running a few days ago, repl lag would rarely go above 1. CPU usage is generally higher than before. Our configuration is almost identical to what it was on 5.7 so I am unsure what to check here.

Servers:
1 x Write master and 5 x Read slave, all on Rackspace 15GB IO servers, with the database stored on a 150GB RAID 10-protected SSD. Database is around 50GB.

my.cnf settings (slave):

user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /data/var/lib/mysql
tmpdir = /tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking
sql-mode = ‘STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION’

bind-address = x.x.x.x
key_buffer_size = 16M
max_allowed_packet = 16M
lower_case_table_names = 1
thread_stack = 192K
thread_cache_size = 8

max_connections = 525
max-connect-errors = 1000000
skip-name-resolve
log_error = /var/log/mysql/error.log

server-id = 1670426130
disable_log_bin

innodb_file_per_table
innodb_buffer_pool_size = 8G
innodb_flush_log_at_trx_commit = 2
sync_binlog = 0
innodb_flush_method = O_DIRECT
slave_skip_errors = 1062
slave_compressed_protocol = 1
innodb_io_capacity_max=4000
innodb_io_capacity=400
innodb_log_file_size = 128M
slave_type_conversions = ALL_NON_LOSSY

relay_log = /var/lib/mysql-relay-logs/mysql-relay-bin
relay_log_index = /var/lib/mysql-relay-logs/mysql-relay-index

iostat output (xvde is the data drive):

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
xvda              2.39        43.10        12.91         0.00    3611678    1081510          0
xvdd              0.20         8.58         0.00         0.00     718620          0          0
xvde            183.25       123.03      3116.60         0.00   10309873  261177252          0

show slave status output:

*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: x.x.x.x
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.013007
          Read_Master_Log_Pos: 5121445
               Relay_Log_File: mysql-relay-bin.025340
                Relay_Log_Pos: 218802
        Relay_Master_Log_File: mysql-bin.013007
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 218586
              Relay_Log_Space: 5121918
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 12
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1670425806
                  Master_UUID: 9ac7164c-7625-11ed-8536-bc764e080bec
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for dependent transaction to commit
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
       Master_public_key_path: 
        Get_master_public_key: 0
            Network_Namespace: 
1 row in set, 1 warning (0.01 sec)

Any help would be appreciated.

Sometimes Slave_SQL_Running_State shows “Waiting for replica workers to process their queues”

1 Like

First of all looking at your my.cnf of slave, you should remove this

This is not the cause of lag but it can make your database inconsistent. You should be aware if you hit replication error so that you can fix it. if you keep skipping without knowing then your slave data may not be consistent. There can be a problem that it is trying to skip the error and lagging for some seconds.

Other point is that in mysql 5.7 the replication was not multi threaded by default. And from 8.0.27 replica_parallel_workers is by default set to 4. So your replication is multi threaded and may be causing deadlock situation with multiple threads.
I would suggest that if the replication was good without multi threading before in mysql 5.7, you switch to single threaded replication
or if you need multi threaded to continue then you can set the below variables as well to behave properly,
replica_preserve_commit_order=ON
replica_parallel_type=logical_clock
you might need to stop replication before making these changes.

On the other hand if you want to switch replica to single threaded,
you can just stop replication and set it in below way,
stop replica;
start replica until SQL_AFTER_MTS_GAPS;
stop replica;
set global replica_parallel_worker=1;
start replica;

Make sure whatever you change in variable, you also preserve it under my.cnf so it persists.

1 Like

Thanks. I’ve changed slaves to single threaded, though it doesn’t seem to make a massive difference. Perhaps slightly better.

I’m monitoring the databases with newrelic, and the most obvious difference is that, compared to last week on MySQL, is that the Handler writes have gone way up. Is this expected on mysql 8?

1 Like

Hi Tom,

new default value change between major version. Even if you reuse same configuration as in 5.7 , some other values might change on 8.0

One example is “internal_tmp_mem_storage_engine”. This variable did not exist on 5.7 and now on 8.0 it has a value of “TempTable” (new storage engine). To revert back to 5.7 behaviour you should execute:
SET GLOBAL internal_tmp_mem_storage_engine = MEMORY;

Apart from checking and comparing 8.0 configuration, can you also check that all OS related configuration stays the same and hardware is similar or better? I.e for example certain disk schedulers might be faster than others as shown in Linux schedulers in tpcc like benchmark - Percona Database Performance Blog . Is the server swapping or having any sub system being saturated?

Last, replication lag can be because multiple reasons, i.e hardware saturation. bugs, changes in configuration, misconfiguration, workload, contention, etc… Maybe you can have a similar performance by tweaking some variables but maybe getting to the root of the replication lag might require extensive troubleshooting so please start confirming the basic as mentioned above

Regards

1 Like

Can you try disabling innodb_log_writer_threads?
Its a dynamic variable introduced in 8.0.22.
See if that helps you?

set global innodb_log_writer_threads = OFF;

I agree with Carlos as well. There can be many reasons.

1 Like

innodb_log_file_size = 128M

This is a very small redo log. I would have this at least at 1G. I also suggest you hook up PMM and look at the wait/io graphs to show you which files have the most amount of locking/wait timings. That will help pinpoint where the slowness is happening.

1 Like

Thanks. I’ve made a few of the suggested changes (I think the single threaded replication helped a bit) though it isn’t quite where I would like it. We have moved from Ubuntu 18.04 to Ubuntu 20.04, but the systems were configured basically the same. I will suggest to the team we try PMM.

1 Like

Hey Tom,
Sometime ago I came across the situation with 8.0.30. I noticed is that there is new variable introduced in 8.0.30 innodb_redo_log_capacity and its default value is 100M which is totally half of redo log file which was by default in previous version.
Previous version it was using innodb_log_file_size and innodb_log_files_in_group
something approx 128 * 2 which comes 256M. So consider increasing the size of innodb_redo_log_capacity as the old variables were deprecated from latest 8.0.30.

And yes same was suggested by Mathew as well. Increasing that for some good value will setup your writes better.

set the innodb_redo_log_capacity to at least around 1G because it will try to create 32 redo files like if you set 1G then 1G/32 for each redo log file.
PMM again is best tool for performance monitoring and you should surely consider it to monitor mysql. Let us know how it goes.

1 Like

So, based on what I’ve found from PMM and newrelic, it seems that a user is using some of our apis in a bad way, and the reads from them is maxing out the cpu on the slaves, which is slowing down the replication. I just need to figure out why the reads are performing worse than they were in mysql 5.7.

1 Like

@Tom-sl In a general sense, yes, single query execution is slower in 8.0 than 5.7.