We have x3 node PXC 8.0.36 setup that was used as the Primary DB for our workload.
The specs for each node are: 40 CPUs, 768G RAM, 10G network. Dataset size is ~8TB.
We were using it as single-writer and it was performing well, reaching 100K QPS at peak times.
Our workload is write intensive, generating 150M+ new records per day, row writes (inserts/updates/deletes) reaching ~15k changes per second in peak time.
We have decided to move the workload on another physical server that runs Percona Server 8.0.45 and we wanted to use the PXC as Async replica playing the role of a hot backup.
We have configured PXC node1 as async replica to the new server and switched applications to the new DB server (single Percona Server).
The problem is that in that setup the PXC could not catch with the changes happening on the Primary, it had ever-increasing replication lag. Then we decided to shutdown 2 nodes (node2 and node3) and leave only the node that was configured for replication (node1). Once node2 and node3 were down, node1 started to process incoming transactions 3-4 times quicker and is able to catch-up and keep the replication lag at 0s. When I try to join another node (let’s say node2) in the cluster, from the moment the node informs the cluster that it will join and starts the sequence of checking Galera buffer and stuff, I can observe immediate throttling on processing transaction on the node1, from ~140k handlers to ~ 40k handlers per second.
The question is: Why the PXC that was once processing the same amount of transactions without issues now can not process them in a Replica setup?
We don’t have big transactions - all are small and quick. Also, we have intentionally disabled doublewrite and use relaxed logging setting to reduce disk syncs. The metrics say that all of the 96 replication threads are assigned ~1% of the total transactions, so we have perfect multi-threaded replication. The WRITESET setup for async replication was working fine for years between the PXC (as Primary) and 3 other Replicas in different physical locations.
There is no cluster flow control kicking in to slow down things, no high cpu or disk utilisation, nothing obvious on HW / OS level.
Here are some settings from the PXC that I think might be important for this post:
binlog-format = ROW
binlog-row-image = MINIMAL
binlog-cache-size = 64K
binlog-expire-logs-seconds = 172800
sync-binlog = 0
source-verify-checksum = ON
gtid-mode = ON
enforce-gtid-consistency = ON
log-replica-updates = ON
replica-parallel-workers = 96
replica-parallel-type = LOGICAL_CLOCK
replica-preserve-commit-order = ON
relay-log-info-repository = TABLE
relay-log-recovery = ON
innodb-write-io-threads = 48
innodb-read-io-threads = 24
innodb-io-capacity = 10000
innodb-io-capacity-max = 30000
innodb-doublewrite = OFF
innodb_flush_log_at_timeout = 1800
binlog-transaction-dependency-tracking = WRITESET
binlog-transaction-dependency-history-size = 250000
transaction-write-set-extraction = XXHASH64
wsrep-provider-options = "gcache.size=48G;gcache.page_size=4G;gcs.fc_limit=120;gcs.fc_master_slave=NO;gcache.recover=YES;"
wsrep-applier-threads = 128
pxc-strict-mode = ENFORCING
pxc-encrypt-cluster-traffic = OFF
wsrep-applier-FK-checks = OFF
