Since mysql8 upgrade replica gets constantly behind and timesout eventually leading to a crash

We have mysql 8.0.33-25.1 percona wsrep cluster with 6 nodes. The percona cluster is stable and works as expected.

One of the nodes is also a binlog enabled replication primary.
On the other side is 1 replica.
Those are AWS instances of comparable size, the replica has more dedicated IOPS than the primary, due to the nature of its workload.

This same setup was used on mysql5.7 which we migrated away from a few months ago.

Since the migration the replica is behaving incredibly strangely. I understand that there are probably some under-the-hood changes on mysql replication, but what we are experiencing is not stable at all.
The workload on that replica is mostly reads. On the replica there are databases that do not get replicated that do have writes, but as far as replication is concerned on that replica it is only reads.

Here are 2 example workflows that are running on the replica:

  • DB1 is replicated from the primary and the replica reads data from there, creates “snapshots” of the data and writes to DB2 on the replica (which is not replicated in any server).
  • Tableau reads data sequentially once per day that is then used for reports throughout the day from the replica on a schedule.

There are other workloads of smaller scale but same type that happens on that replica.

The problem that we have is during any significant read query the replication pauses and waits for the read query to finish before continuing. Resulting in often timeouts, as those read queries may run for (sometimes) many hours. This results in the following error:

2024-03-21T09:49:22.030113Z 2386083 [Warning] [MY-010584] [Repl] Replica SQL for channel '': Worker 1 failed executing transaction 'ANONYMOUS' at source log mysql-bin.000639, end_log_pos 853032011; Could not execute Update_rows event on table DB1.table1; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's source log mysql-bin.000639, end_log_pos 853032011, Error_code: MY-001205

We can increase the timeout, but thats not what we are after. Prior to the migration, the 5.7 version of the replica was keeping up to date with replication throughout the day. During these past 2 months, we have tried tweaking various variables in an effort to get back to the same state as in 5.7, but with no success. These are the final variables that we are using at the moment different from the previous defaults that we had:

replica_preserve_commit_order = "OFF";
replica_parallel_type = "DATABASE";

And this is our default cnf (some values are edited out):

# Template my.cnf for PXC
# Edit to your requirements.

open-files-limit                = 100000
table_open_cache                = 20000
innodb_open_files               = 20000

character-set-server            = utf8
collation-server                = utf8_unicode_ci

datadir                         = XXX
long_query_time                 = 5
slow_query_log                  = 1
slow_query_log_file             = XXX
log_bin                         = mysql-bin
log-error                       = XXX
pid-file                        = XXX
default_storage_engine          = InnoDB
pxc_strict_mode                 = ENFORCING
early-plugin-load               =
keyring-file-data               = XXX
pxc-encrypt-cluster-traffic     = ON
socket                          = XXX

server-id                       = 9
binlog-ignore-db                = information_schema
binlog-ignore-db                = mysql
binlog-ignore-db                = performance_schema
binlog-ignore-db                = XXX
binlog_format                   = ROW
log_slave_updates               = ON

binlog_expire_logs_seconds      = 604800
ft_min_word_len                 = 1
ft_stopword_file                = ''
join_buffer_size                = 256K
key_buffer_size                 = 16M
lc-messages-dir                 = /usr/share/mysql
lower_case_table_names          = 0
max_allowed_packet              = 512M
max_connections                 = 400
max_heap_table_size             = 16M
memlock                         = 0
read_buffer_size                = 256K
read_rnd_buffer_size            = 512K
replica_preserve_commit_order   = OFF
replica_parallel_type           = DATABASE
replica_parallel_workers        = 1
sort_buffer_size                = 256K
sysdate_is_now                  = 1
table_definition_cache          = 20000
thread_cache_size               = 20
thread_stack                    = 256K
tmp_table_size                  = 16M
tmpdir                          = XXX
net_read_timeout                = 60

# 5.6 backwards compatibility
explicit_defaults_for_timestamp = 1

# INNODB OPTIONS - You may want to tune the below depending on number of cores and disk sub
innodb_autoinc_lock_mode        = 1
innodb_adaptive_hash_index      = OFF
innodb_buffer_pool_instances    = 8
innodb_buffer_pool_size         = 12G
innodb_doublewrite              = 1
innodb_file_per_table           = 1
innodb_flush_log_at_trx_commit  = 3
innodb_flush_method             = O_DIRECT
innodb_io_capacity              = 200
innodb_log_buffer_size          = 8M
innodb_log_files_in_group       = 2
innodb_log_file_size            = 128M
innodb_read_io_threads          = 4
innodb_thread_concurrency       = 0
innodb_write_io_threads         = 4

# avoid statistics update when doing e.g show tables
innodb_ft_min_token_size        = 0
innodb_stats_on_metadata        = 0

pid-file                        = XXX
malloc-lib                      = /usr/lib64/

I am assuming that we are missing a key piece here, but so far unable to find it. As a last resort we are considering changing the default TXN isolation from REPEATABLE READ to READ COMMITTED or UNCOMMITTED, which may help with the locking, but that will require us to do significantly more research into implications in our use cases, so would like to avoid if at all possible.

Any suggestions are welcomed?

This value is deprecated. You should switch to LOGICAL_CLOCK and increase the number of replica threads. This also means you have to re-enable replica_preserve_commit_order.

This is an invalid value. I’m surprised MySQL even starts. Change this to 2.

ROW-based replication should not impose any locks on replicas, unless FOREIGN KEYs are involved. Do you have any of these?

It would be helpful to understand the locks. Find a SELECT query that runs on the replica that reproduces the issue. Run the query in session #1. While it is locked, in session #2, select from sys. innodb_lock_waits and see what query is blocking. This might help determine what is happening.

These were all in effort to get back to mysql 5.7 type of replication and performance. Because with LOGICAL_CLOCK and replica_preserve_commit_order we were getting even worse performance and locking.

innodb_flush_log_at_trx_commit reverts to a value of 2 at runtime, but yes, i was surprised it works too.

There are no foreign keys in any database, we removed them years ago due to other performance issues.

As far as the locks, we know which table is getting locked, due to replication needing to update some of its records. Here is an example:

wait_started	wait_age	wait_age_secs	locked_table	locked_table_schema	locked_table_name	locked_table_partition	locked_table_subpartition	locked_index	locked_type	waiting_trx_id	waiting_trx_started	waiting_trx_age	waiting_trx_rows_locked	waiting_trx_rows_modified	waiting_pid	waiting_query	waiting_lock_id	waiting_lock_mode	blocking_trx_id	blocking_pid	blocking_query	blocking_lock_id	blocking_lock_mode	blocking_trx_started	blocking_trx_age	blocking_trx_rows_locked	blocking_trx_rows_modified	sql_kill_blocking_query	sql_kill_blocking_connection
2024-04-09 08:30:33	00:00:09	9	`DB1`.`table1`	DB1	table1	NULL	NULL	idx_contact	RECORD	117850381337	2024-04-09 08:30:33	00:00:09	1	1	1697121	NULL	139791340341016:665006:958203:570:139774144774368	X,GAP,INSERT_INTENTION	421266317044760	2287080	INSERT INTO `DB2`.`table2`.`Value`	139791340334104:665006:958203:570:139784041354192	S,GAP	2024-04-09 08:29:52	00:00:50	16580219	0	KILL QUERY 2287080	KILL 2287080

The blocking thread in question 2287080:

2287080	XXX	XXX	DB1	Query	58	executing	INSERT INTO `DB2`.`table2`........ SELECT                ...... COUNT(DISTINCT `table1`.`ContactId`) ............ FROM `table3` ........ ON DUPLICATE KEY↵                UPDATE `XXXX` = VALUES(`XXXX`)	57595	0	0

The blocking query is a INSERT SELECT query, it selects from DB1 which is replicated from the primary to the replica and inserts into DB2 which is not replicated anywhere and is only on the replica, used for aggregation.

While DB1.table1 is locked replication start lagging behind with:

1697121	system user		DB1	Query	24	Applying batch of row changes (write)	NULL	23867	0	0
1697120	system user		NULL	Query	21	Reading event from the relay log	NULL	20856	0	0

Some of those queries can have a runtime of several thousand seconds, which then leads to the replication to stop with a timeout error i shared in the first post.

Since this will be the only option soon, it’s better to start using this value now and try to figure it out, rather than being forced to when the parameter changes.

Because 3 is invalid, it must just make it 2. There’s probably a message in the error log. You should set it to 2 then.

This, naturally, imposes (S)hared locks on the DB1.table1 table as it is being SELECTed. This prevents table1 from being modified while the “SELECT *” is working. Without this lock, table1 could be modified and the INSERT would no longer be (C)onsistent, nor (I)solated (eg: ACID compliant). This lock also blocks replication.

Replication always operates in READ-COMMITTED mode and you can’t change that. I would suggest changing your I … S to use READ-COMMITTED as I can see the query is also GAP-locking, which means it’s not only locking the actual existing rows, it’s also locking the “gaps” of missing rows and the “gaps” at the end of the table. R-C does not have gap locks.

will look to apply the default again and change the TX mode on the report queries to READ-COMMITED and report back when we have more news.

changing the TX isolation to READ-COMMITTED for that specific session seems to have helped in reducing the locks. Will leave it working like that for a few more days on our test bed, but so far so good. Now time to verify this didnt impact any of our underlying report data.