Performance of the asynchronous replication

Hello there.
In recent days I am reviewing GR on 8.0.36-28.
The setup is quite simple:
source: 5.7 production cluster → asynch rep →
destination 8.0.36-28 + anonymous GTID → asynch rep → destination GR 8.0.36-28, 4 nodes.
GR is consist 4 nodes:
Two nodes co-located and two faraway.
When the RTT between the nodes can be up to 62ms (mttr statistics).
GR use MYSQL communication stack.

rep8 server is co-located with two another nodes in GR (not part of GR, same geographical location)
Let’s name it for simplicity:
prd57 → rep8 → GR8
All GR8 configured with:

sync_binlog = 0
sync_source_info = 0
sync_relay_log = 0
sync_relay_log_info = 0
innodb_flush_log_at_trx_commit = 0
group_replication_get_write_concurrency set to 200
group_replication_paxos_single_leader set to ON

At prd57 my system generates an average of 24GB binlog’s per day.
On the rep8 (replica_parallel_workers = 64):
I see following statistics in mysqld.log:

Multi-threaded replica statistics for channel 'prd57': seconds elapsed = 120; 
events assigned = 1560289281; 
worker queues filled over overrun level = 0; 
waited due a Worker queue full = 0; 
waited due the total size = 0; 
waited at clock conflicts = 28294655133800 
waited (count) when Workers occupied = 107021 
waited when Workers occupied = 100201337200

Considering each event is transaction (at least it what I understood from documentation) my source produce an average of 13M trx per second.

On the GR8 (replica_parallel_workers = 132, all 4 nodes primary, one of the faraway nodes receives the replication stream):
Multi-threaded replica statistics for channel ‘rep8’:

seconds elapsed = 131; 
events assigned = 15285249; 
worker queues filled over overrun level = 0; 
waited due a Worker queue full = 0; 
waited due the total size = 0; 
waited at clock conflicts = 61831187482500 
waited (count) when Workers occupied = 15089 
waited when Workers occupied = 172320096500

i.e. average of 116K trx per second.

Both replicas spent over 100 seconds waiting for the workers to be free.
The lag build up slowly but constantly.

Interestingly on the GR nodes statistics for the group_replication_applier showing stable over 30M events
events assigned = 33902593, i.e ~250K trx per second. And no waits or other activity.

There is typical picture of waits for one of the worker threads on the GR node that receives the replication stream:

| PROCESSLIST_ID | THREAD_ID | THREAD_NAME  | STATE                   | OBJ                               | LOCK_TYPE | LOCK_MODE  | LOCK_STATUS | MD_LOCK_OBJ  | MD_LOCK_TYPE  |
|  20110 |  20198 | sql/replica_worker  | waiting for handler commit  | NONE                              | TABLE  | IX  | GRANTED               | m1  | SHARED_WRITE  |
|  20110 |  20198 | sql/replica_worker  | waiting for handler commit  | db1.m1:1972269072                 | RECORD | X,REC_NOT_GAP | GRANTED     | m1  | SHARED_WRITE  |
|  20110 |  20198 | sql/replica_worker  | waiting for handler commit  | mysql.slave_worker_info:'rep8', 4  | RECORD | X,REC_NOT_GAP | GRANTED     | m1  | SHARED_WRITE  |
|  20110 |  20198 | sql/replica_worker  | waiting for handler commit  | NONE                              | TABLE  | IX  | GRANTED  | NULL       | INTENTION_EXCLUSIVE |
|  20110 |  20198 | sql/replica_worker  | waiting for handler commit  | db1.m1:1972269072                 |  RECORD| X,REC_NOT_GAP | GRANTED     | NULL  | INTENTION_EXCLUSIVE |
|  20110 |  20198 | sql/replica_worker  | waiting for handler commit  | mysql.slave_worker_info:'rep8', 4  |  RECORD| X,REC_NOT_GAP | GRANTED     | NULL  | INTENTION_EXCLUSIVE |

As we can see each worker:

  1. lock relevant row in the table mysql.slave_worker_info to update position after trx.
  2. lock relevant row in the changed table
    wait for “waiting for handler commit” which is understandable as it need to execute 2pc and local commit as well.

When I am switching the GR to be single-primary the statistics not changed much.
However, when I am shutdown one of the faraway nodes and move the replication stream to one of the co-located nodes the picture changed and lag get processed very fast.
I’ve traced the replica worker (when it slows down and lag growth) using:

call ps_trace_thread(<thread>,'<output file>',NULL,NULL,TRUE,TRUE,TRUE);

According to the trace in the 60 seconds (default for ps_trace) , one thread executed 792 transactions (GTID: XXXX) (over 10 trx per second, seems to be good).
The most significant waits:

sql/Waiting for an event from Coordinator: 7.2795 sec
sql/Worker_info::jobs_cond-wait in rpl_rli_pdb.cc:2367: 7.5108 sec
sql/waiting for handler commit: 40.3549 sec
group_rpl/COND_count_down_latch-wait in plugin_utils.h:438: 45.6882 sec

I believe that the wait for “COND_count_down_latch” includes the wait for “sql/waiting for handler commit”.
However, as we know, the wait is not bounded, but the work is.
So maybe not. I am not sure how to improve the situation above.

The most interesting questions is:
Is above slowdown is related to the fact that replication workers failing to keep with the stream and nothing to do with the GR.
Is above slowdown proof that GR can’t keep up with the load and I need to consider another solutions (currently galera working just fine)
Is there any configuration change that I need to do to make it work?
What it means and how improve: COND_count_down_latch ? How I can investigate it? (didn’t found any thing in documentation or code).

Many thanks.
Evgeni

This is not ideal. Loss of connectivity between the networks will result in entire cluster offline; split-brain. Additionally, the MySQL manual says that GR is not ideal for WANs.

This is also not ideal; You should switch to the native Xcom protocol.

This is highly unlikely. The fastest MySQL server I’ve ever seen using $15,000 FusionIO cards benchmarked 144,000 tps. A typical high-performance MySQL server will average 20-40K tps. Certainly not millions per second.

Unless each server has 256 CPU cores, this parameter is very oversized. You should not have more than 1/2 CPU count as parallel workers. You need to leave CPU resources for all the other things MySQL needs to do, including disk access.

That is expected. Each trx must be synchronously replicated to a majority of members. The less members, the less time this action takes.

Thanks for your time.

  1. The setup with 4 nodes is for testing purposes, thanks, I am aware of the potential issues associated with this configuration.
  2. According to the documentation, the MYSQL communication stack is newer and it also in use by the NDB cluster. I switched to it in an attempt to resolve the issue, the behavior of XCOM was identical in my situation.
  3. The documentation stating that “events assigned” is the transaction count, the source being a 6-node busy Galera cluster, it may produce this numbers, Do you think I misunderstood the documentation and my interpretation of “events assigned” is wrong? (I do see correlation of this statistic and cluster behaviour)
  4. I have 256 cores per server, so there should be no issue with the number of threads. It’s worth mentioning that there is no resource starvation on the server side (although I had hoped for this, it’s not the case).
  5. While some delay is expected, it shouldn’t be significant (62ms RTT is pretty good even for local networks). If I were to accumulate a few hours’ worth of changes, then shut down replication and start up the GR, the overall recovery is almost immediate (well, not immediate, but reasonably fast).
  6. It seems that regular replication (not GR) is struggling because it needs to change the mysql.slave_worker_info very often. I suspect some internal contention, but unfortunately, I’m unable to find any evidence for that. Therefore, I am trying to understand what the COND_count_down_latch-wait is.

Best Reagrds

I’m feeling quite lost at the moment, so any help or direction would be greatly appreciated.
I understand that an RTT below 100ms is considered quite good.
Despite this, I’m still experiencing lag. Could it be that one replication channel, even with 128 threads, simply isn’t able to keep up with the load? (although for single server it working well).
Perhaps in GR all nodes will get writes and the load would be “spread” over nodes? Then the additional latency would be not so noticable?

Sorry,
it seems I misinterpreted the documentation:
The ‘events assigned’ metric indicates the total number of events assigned by the coordinator since its initiation. Therefore, the difference between reported values reflects the actual number of events and not the reported number.
Seems my system recieve an average 3K - 4K tps. which is pretty low.
Now I am confused even more.
~23 tps per thread (128 thread, ~3K tps) and I have such a lag?

Thanks!

All nodes do get the writes. But that doesn’t “spread” anything. If you insert 30K rows per second into the primary, that primary must replicate 30Krps to each other member, and each other member must process 30Krps, and write to disk, 30Krps. GR is not a write scaling solution; it’s an HA solution. NDB, on the other hand, is a true write scaling solution because the data is fragmented/sharded among the data nodes.

You could log into MySQL and run \s to easily verify this (below is my dummy server):

mysql> \s
...
Threads: 2  Questions: 10043  Slow queries: 0  Opens: 745  Flush tables: 3  Open tables: 664  Queries per second avg: 0.061

Agreed on confusion. Without saying their name, I have a well known credit-card company as one of my customers. They use 5-node GRs for each ‘shard’ of data. Each shard runs a consistent 20K tps with no lag/flow control on any nodes. I can also tell you that the specs of their machines is far less than yours (32vCPU, 128GB RAM).

So, there’s either a major config difference (haven’t seen your entire config) or there’s an issue with your hardware on that lagging host. Do you have PMM monitoring all the nodes?

Have you benchmarked a standalone GR? (ie: not running via async replication) I would attempt this. Create a local 3-node and blast it with sysbench. Then add a 4th remote node and blast it. See how it holds up. Keep remote 4th, but stop one of the local 3. Repeat. How does that go? Add a 5th remote. Repeat.

Gather some baseline metrics and see where things start to fail. Do yourself a huge favor and install PMM on each node to collect all the things. You can’t begin to correct what you’re not observing.

Thanks for your time.
The GR standalone works just fine. I don’t see any issue or lag’s.
btw, there is example of \s from GR host that receive replication stream:

Threads: 75 Questions: 861251340 Slow queries: 381641316 Opens: 4978 Flush tables: 4 Open tables: 2307 Queries per second avg: 2004.807`

And there is an prod server (now, saturday morning the traffic is not hight):

Threads: 237 Questions: 30018857203 Slow queries: 287301524 Opens: 28680 Flush tables: 1 Open tables: 8192 Queries per second avg: 16685.142

The \s command not reports tps if I am not mistaken and this is cumulative and therefore flattering spikes, isn’t? (mine long_query_time=0)
btw, Is your client use asnynch replication together with GR and what the RTT between GR nodes (if you can share)?
I think that the issue is with asynch + GR + network latency.

I do have PMM and I really don’t see anything that can explain this (both on linux and mySQL).
My config is pretty simple: (I have 500GB memory on each server)


binlog_transaction_dependency_tracking              = 'WRITESET'
group_replication_recovery_compression_algorithms   = 'zstd'
group_replication_compression_threshold             = 102400
replica_pending_jobs_size_max                       = 536870912
replica_parallel_workers                            = 64
group_replication_paxos_single_leader               = 1
group_replication_transaction_size_limit            = 0
group_replication_member_expel_timeout              = 3600
sync_binlog                                         = 0
binlog_row_image                                    = minimal
binlog_transaction_compression                      = ON
max_binlog_size                                     = 1G
sync_source_info                                    = 0 
sync_relay_log                                      = 0 
innodb_dedicated_server                             = 1 #let server to choice memory
innodb_log_buffer_size                              = 104857600
innodb_write_io_threads                             = 64
innodb_read_io_threads                              = 64
innodb_monitor_enable                               = all
innodb_flush_method                                 = O_DIRECT_NO_FSYNC
innodb_io_capacity                                  = 2000
innodb_flush_log_at_trx_commit                      = 0 
innodb_doublewrite                                  = OFF
sort_buffer_size                                    = 128M
read_rnd_buffer_size                                = 128M
max_heap_table_size                                 = 2G
tmp_table_size                                      = 2G
transaction_isolation                               = READ-COMMITTED

setting of sync_source_info, sync_relay_log to 0 increased the performance of the asynch replication.

I am really appreciate your help.
Evgeni