Hello there.
In recent days I am reviewing GR on 8.0.36-28.
The setup is quite simple:
source: 5.7 production cluster → asynch rep →
destination 8.0.36-28 + anonymous GTID → asynch rep → destination GR 8.0.36-28, 4 nodes.
GR is consist 4 nodes:
Two nodes co-located and two faraway.
When the RTT between the nodes can be up to 62ms (mttr statistics).
GR use MYSQL communication stack.
rep8 server is co-located with two another nodes in GR (not part of GR, same geographical location)
Let’s name it for simplicity:
prd57 → rep8 → GR8
All GR8 configured with:
sync_binlog = 0
sync_source_info = 0
sync_relay_log = 0
sync_relay_log_info = 0
innodb_flush_log_at_trx_commit = 0
group_replication_get_write_concurrency set to 200
group_replication_paxos_single_leader set to ON
At prd57 my system generates an average of 24GB binlog’s per day.
On the rep8 (replica_parallel_workers = 64):
I see following statistics in mysqld.log:
Multi-threaded replica statistics for channel 'prd57': seconds elapsed = 120;
events assigned = 1560289281;
worker queues filled over overrun level = 0;
waited due a Worker queue full = 0;
waited due the total size = 0;
waited at clock conflicts = 28294655133800
waited (count) when Workers occupied = 107021
waited when Workers occupied = 100201337200
Considering each event is transaction (at least it what I understood from documentation) my source produce an average of 13M trx per second.
On the GR8 (replica_parallel_workers = 132, all 4 nodes primary, one of the faraway nodes receives the replication stream):
Multi-threaded replica statistics for channel ‘rep8’:
seconds elapsed = 131;
events assigned = 15285249;
worker queues filled over overrun level = 0;
waited due a Worker queue full = 0;
waited due the total size = 0;
waited at clock conflicts = 61831187482500
waited (count) when Workers occupied = 15089
waited when Workers occupied = 172320096500
i.e. average of 116K trx per second.
Both replicas spent over 100 seconds waiting for the workers to be free.
The lag build up slowly but constantly.
Interestingly on the GR nodes statistics for the group_replication_applier showing stable over 30M events
events assigned = 33902593, i.e ~250K trx per second. And no waits or other activity.
There is typical picture of waits for one of the worker threads on the GR node that receives the replication stream:
| PROCESSLIST_ID | THREAD_ID | THREAD_NAME | STATE | OBJ | LOCK_TYPE | LOCK_MODE | LOCK_STATUS | MD_LOCK_OBJ | MD_LOCK_TYPE |
| 20110 | 20198 | sql/replica_worker | waiting for handler commit | NONE | TABLE | IX | GRANTED | m1 | SHARED_WRITE |
| 20110 | 20198 | sql/replica_worker | waiting for handler commit | db1.m1:1972269072 | RECORD | X,REC_NOT_GAP | GRANTED | m1 | SHARED_WRITE |
| 20110 | 20198 | sql/replica_worker | waiting for handler commit | mysql.slave_worker_info:'rep8', 4 | RECORD | X,REC_NOT_GAP | GRANTED | m1 | SHARED_WRITE |
| 20110 | 20198 | sql/replica_worker | waiting for handler commit | NONE | TABLE | IX | GRANTED | NULL | INTENTION_EXCLUSIVE |
| 20110 | 20198 | sql/replica_worker | waiting for handler commit | db1.m1:1972269072 | RECORD| X,REC_NOT_GAP | GRANTED | NULL | INTENTION_EXCLUSIVE |
| 20110 | 20198 | sql/replica_worker | waiting for handler commit | mysql.slave_worker_info:'rep8', 4 | RECORD| X,REC_NOT_GAP | GRANTED | NULL | INTENTION_EXCLUSIVE |
As we can see each worker:
- lock relevant row in the table mysql.slave_worker_info to update position after trx.
- lock relevant row in the changed table
wait for “waiting for handler commit” which is understandable as it need to execute 2pc and local commit as well.
When I am switching the GR to be single-primary the statistics not changed much.
However, when I am shutdown one of the faraway nodes and move the replication stream to one of the co-located nodes the picture changed and lag get processed very fast.
I’ve traced the replica worker (when it slows down and lag growth) using:
call ps_trace_thread(<thread>,'<output file>',NULL,NULL,TRUE,TRUE,TRUE);
According to the trace in the 60 seconds (default for ps_trace) , one thread executed 792 transactions (GTID: XXXX) (over 10 trx per second, seems to be good).
The most significant waits:
sql/Waiting for an event from Coordinator: 7.2795 sec
sql/Worker_info::jobs_cond-wait in rpl_rli_pdb.cc:2367: 7.5108 sec
sql/waiting for handler commit: 40.3549 sec
group_rpl/COND_count_down_latch-wait in plugin_utils.h:438: 45.6882 sec
I believe that the wait for “COND_count_down_latch” includes the wait for “sql/waiting for handler commit”.
However, as we know, the wait is not bounded, but the work is.
So maybe not. I am not sure how to improve the situation above.
The most interesting questions is:
Is above slowdown is related to the fact that replication workers failing to keep with the stream and nothing to do with the GR.
Is above slowdown proof that GR can’t keep up with the load and I need to consider another solutions (currently galera working just fine)
Is there any configuration change that I need to do to make it work?
What it means and how improve: COND_count_down_latch ? How I can investigate it? (didn’t found any thing in documentation or code).
Many thanks.
Evgeni