PXC configured as async replica can not decrease replication lag

We have x3 node PXC 8.0.36 setup that was used as the Primary DB for our workload.

The specs for each node are: 40 CPUs, 768G RAM, 10G network. Dataset size is ~8TB.

We were using it as single-writer and it was performing well, reaching 100K QPS at peak times.

Our workload is write intensive, generating 150M+ new records per day, row writes (inserts/updates/deletes) reaching ~15k changes per second in peak time.

We have decided to move the workload on another physical server that runs Percona Server 8.0.45 and we wanted to use the PXC as Async replica playing the role of a hot backup.

We have configured PXC node1 as async replica to the new server and switched applications to the new DB server (single Percona Server).

The problem is that in that setup the PXC could not catch with the changes happening on the Primary, it had ever-increasing replication lag. Then we decided to shutdown 2 nodes (node2 and node3) and leave only the node that was configured for replication (node1). Once node2 and node3 were down, node1 started to process incoming transactions 3-4 times quicker and is able to catch-up and keep the replication lag at 0s. When I try to join another node (let’s say node2) in the cluster, from the moment the node informs the cluster that it will join and starts the sequence of checking Galera buffer and stuff, I can observe immediate throttling on processing transaction on the node1, from ~140k handlers to ~ 40k handlers per second.

The question is: Why the PXC that was once processing the same amount of transactions without issues now can not process them in a Replica setup?

We don’t have big transactions - all are small and quick. Also, we have intentionally disabled doublewrite and use relaxed logging setting to reduce disk syncs. The metrics say that all of the 96 replication threads are assigned ~1% of the total transactions, so we have perfect multi-threaded replication. The WRITESET setup for async replication was working fine for years between the PXC (as Primary) and 3 other Replicas in different physical locations.

There is no cluster flow control kicking in to slow down things, no high cpu or disk utilisation, nothing obvious on HW / OS level.

Here are some settings from the PXC that I think might be important for this post:

binlog-format                     = ROW
binlog-row-image                  = MINIMAL
binlog-cache-size                 = 64K
binlog-expire-logs-seconds        = 172800
sync-binlog                       = 0
source-verify-checksum            = ON

gtid-mode                         = ON
enforce-gtid-consistency          = ON
log-replica-updates               = ON
replica-parallel-workers          = 96
replica-parallel-type             = LOGICAL_CLOCK
replica-preserve-commit-order     = ON
relay-log-info-repository         = TABLE
relay-log-recovery                = ON

innodb-write-io-threads           = 48
innodb-read-io-threads            = 24
innodb-io-capacity                = 10000
innodb-io-capacity-max            = 30000
innodb-doublewrite                = OFF
innodb_flush_log_at_timeout       = 1800

binlog-transaction-dependency-tracking      = WRITESET
binlog-transaction-dependency-history-size  = 250000
transaction-write-set-extraction            = XXHASH64

wsrep-provider-options            = "gcache.size=48G;gcache.page_size=4G;gcs.fc_limit=120;gcs.fc_master_slave=NO;gcache.recover=YES;"
wsrep-applier-threads             = 128
pxc-strict-mode                   = ENFORCING
pxc-encrypt-cluster-traffic       = OFF
wsrep-applier-FK-checks           = OFF

It’s difficult to determine exactly what went wrong without historical metrics. However, you can review your redo log capacity & buffer pool configuration relative to your workload and ensure they are properly tuned. Since no redo log/buffer pool configuration details were shared above, I assume it is running with default settings.

I was thinking those are not relevant for this specific problem, but here they are:

innodb-redo-log-capacity          = 40G
innodb-log-buffer-size            = 256M
innodb-buffer-pool-size           = 512G
innodb-buffer-pool-chunk-size     = 512M
innodb-buffer-pool-instances      = 32
innodb-page-cleaners              = 32

Hello @Perica_Jakimov,
I have several tables for you to inspect on the PXC nodes when you have 2 members of the cluster:

  • sys.io_by_thread_by_latency
  • sys.io_global_by_file_by_latency
  • sys.io_global_by_wait_by_latency

Perhaps this can give some insight as to what is happening. Anything in the logs of either PXC node?

Thanks for this proposal @matthewb, I needed some time to rebuild a cluster of 2 nodes, so here is the output of those tables.

For io_global_by_file_by_latency I have changed some schema and table names in order to anonymise them and because I had more then 1k records I’ve limited the output to top 100 records.

These stats are taken from then node that actually runs the async replication.

Meanwhile, I’ve tried to change the replica_parallel_workers from 96 to 40 (as I have 40 CPUs) - no changes in processing speed, then tried to use 200 - no changes in processing speed.

Can you advice something that I could try on the cluster?

Thanks

io_by_thread_by_latency.txt (26.9 KB)

io_global_by_file_by_latency.txt (18.0 KB)

io_global_by_wait_by_latency.txt (4.8 KB)

Hi Perica_Jakimov,

You can use the following blogpost to tune replica_parallel_workers . Having 40 sounds look too much since most of the load is usually done by a few:

If most of the load is done by just a few threads, the problem might be contention and not resource saturation.

You can also use the following blogpost to tune redo log:

Above said, every new write must be certified and applied by all nodes before the next one can be written. The more nodes and the higher the latency between nodes then the longer it will take for writes to go through.

You should take a look at either replication latency and flow control

Regards

Thanks @CTutte , all of these things are already known for me.
And as I’ve already stated, the workload running with 96 parallel threads is distributing transactions very well between them, so it is not that only few of them are doing all the work.
The short story is:

  • The same PXC cluster was serving the same workload as Primary DB (apps connected to 1 of it’s nodes) without issues.
  • When I want to use the PXC as a Replica it has a delay, can not apply transactions quick enough.
  • When I use just 1 node from the PXC in single mode then it can catch up and replicate everything without issues.
  • There is no flow control kicking in to slow down the cluster.

So, the problem we are trying to solve is:
Why the PXC throttles the applying of replicated transactions by 3-4 times when it is in a cluster setup?

On the other hand, we have used PXC as replica on different workloads (way less transactions) without any issues.

One main difference about having a PXC which act as a replica is that now you need to have binary log enabled (did you have it before?) with log-replica-updates (https://www.percona.com/blog/percona-xtradb-cluster-mysql-asynchronous-replication-and-log-slave-updates/ ) whereas it was not mandatory before.

Did you have binary log enabled before?

First you need to figure out if the problem is contention or resource saturation; Have you checked show SHOW ENGINE INNODB STATUS for any contention?

How much is the latency between the nodes?

Did anything change (resources, config, network, latency) on the PXC cluster compared for when it was a primary?

Regards

Sure I had binary log enabled, and log_replica_updates enabled.
When it was a Primary we had binlog-row-image=FULL, now when it is used as Replica it is set to MINIMAL.
Also, we are using binlog-transaction-dependency-tracking = WRITESET.

Here is a graph of the latency (node2 and 3 are in the cluster):

When PXC operates as the Primary, writesets are generated locally and replicated synchronously via Galera. However, when acting as an Async Replica, each binlog event must be reprocessed into a writeset, followed by certification and group communication. This introduces additional write amplification and synchronization overhead.
When multiple nodes are present in the cluster, each transaction must wait for acknowledgment from all nodes, which significantly reduces throughput—especially in high-frequency, small-transaction workloads.
Maybe you need to reconsider this architecture design from the perspective of requirements.