sudden transaction rate slowdown during sysbench test

Hi there!

I’m testing PXC 8 with Sysbench and ran into some weird problem, where after about and hour of testing, the QPS rate went down from approx. 10000 qps to approx. 2000 qps. From what I’ve observed so far, the only suspicious value is TCP retransmission rate, which (at the time of slowdown)  went from approx. 0 ops for all nodes to approx. 3 ops for node1 and node2, to approx. 13 ops for node3 (the writer node). At 8:47 UTC node1 sent couple of FC messages and at 8:49 UTC the cluster latency went up from around 1.5ms to around 20ms and the qps rate went down from approx. 10000 qps to approx. 2000 qps. From 8:49 UTC there were no additional FC messages being sent and the cluster just almost stalled.
This actually happened several times, but now I can document it. Can you please point me to the cause of this problem?

There are 3 PXC nodes in one AWS region, each residing in separate AZ. The fourth node is running ProxySQL and I’m running Sysbench on that node locally. ProxySQL has no query rules defined (no R/W split), only simple mysql_galera_hostgroups set with max_writers=1 (node3 is the writer here, node1 and node2 is only applying wsrep). PMM is running on fifth node. All of them are t3.medium type with 4GB of RAM.

sysbench /usr/share/sysbench/oltp_read_write.lua --db-driver=mysql --mysql-host= --mysql-user='sbuser' --mysql-password='sbpass' --mysql-port=6033 --mysql-db=sbtest --tables=1 --table_size=1000000 --db-ps-mode=disable --threads=16 --report-interval=1 --time=3600 --skip-trx=off --mysql-ignore-errors=all run

sysbench_report_from_the_time_of_slowndown.txt (13.1 KB)

node1-myq_status_from_the_time_of_slowndown…txt (48.7 KB)

I’m testing it again and now I have only 2 nodes in the cluster. After some time there is surge at receive/send queue and cluster hangs due to FC. mysqld.log shows some gcache pages being created/deleted. shall I increase the gcache size?