Simple select causing flow control pause of whole cluster

Hello,
we are using MySQL with Percona XtraDB Cluster 5.7.44, 5 nodes (4 multimaster, 1 garb).

When we run run simple query - just SELECT COUNT(*) FROM SomeTable
on node, which is dedicated only for reading requests (there are usually no writeset operation except of replications), it is causing slow down of whole cluster. The node starts to send flow control messages which are leading to flow control pause of whole cluster.
However, if the same query is executed on different node - where both writeset and select operations are executed, it does not affect cluster performance. All nodes have the same HW and MySQL configuration. A few month ago these queries did not affect flow control. Since the time we migrated from MySQL 5.7.37 to 5.7.44 and moved OS from Centos to Ubuntu.

What could be the reason and how fix the situation?

Thank you in advance

Although the node is intended for read-only queries, the SELECT COUNT(*) query, especially when SomeTable is large, may cause significant disk I/O. This could result in the node lagging behind in replication, which triggers flow control.

Since all nodes are running on identical hardware, it’s important to ensure that the read-only node isn’t encountering any resource bottlenecks such as CPU, memory, or disk I/O. To diagnose this, monitor the following:

  • CPU usage ( top or htop )
  • Disk I/O ( iostat , vmstat )
  • Memory usage ( free -m or top )

Hello @lebeda.jaromir,
In addition to what Abhinav said, please verify the value of wsrep_sync_wait on all nodes. If this session parameter is set higher on this one node, then the SELECT could cause flow control.

I would also use this tool, GitHub - jayjanssen/myq_gadgets: myq_gadgets is deprecated by myq-tools!, and have it running while you execute the SELECT and see what is happening.

That is a valid point, though we have not observed any sign of HW bottleneck.

We use PMM to monitor our cluster. To give you a better picture, we have deliberately triggered the flow control scenario by running the SELECT on db2-prd node, at 12:18, and captured relevant PMM metrics from that time, so that you can compare it against normal operation.

The PXC Cluster Summary graph shows how db2-prd sends flow control message onto other nodes, triggering flow control across the other nodes.

The HW-related graphs, in our opinion, don’t show signs of any resource limits being reached at any point

Thank you for the tool, we will try it and I will come back with result

SELECT … FOR UPDATE? or SELECT … IN SHARE MODE?

It should be SELECT … IN SHARE MODE

please verify the value of wsrep_sync_wait on all nodes. If this session parameter is set higher on this one node, then the SELECT could cause flow control.

We have crosschecked it and we have wsrep_sync_wait = 0 on all nodes.

Can you confirm that using the query without IN SHARE MODE, does not cause flow control? Locking in SHARE MODE locks rows against modification, which would block PXCs ability on this node, to process new transactions. Thus, this node sends out flow control panics.

Sorry, the query is executed without any other specification - neither “FOR UPDATE” nor “IN SHARE MODE”. It is causing flow control panics even it is running with isolation level read uncommited.

This is the only SELECT query that causes the issue? Other SELECT on this node are fine? If so, then there is something odd with the table. Is it MyISAM? Please confirm with SHOW CREATE TABLE. Anything in the local node error logs when you run the troublesome SELECT?

No, there are more SELECTs that can cause the issue. We have 2 queries, which always do this (first query to partitioned table, but query is only to one partition with 250 mio rows; second query is related to not partitioned table with about 800 mio rows).
Also another different queries running at the same time to smaller tables can cause the FC panic. It means that this behaviour is not case of only one table only. All our tables are InnoDb.
When running the queries, we can see these type of warnings in error log (about 15 messages in 2 minutes):

[Warning] WSREP: Failed to report last committed 116370864992, -110 (Connection timed out)
Example of the the table definition:
CREATE TABLE `TableXYState` (
  `TableXYStateId` bigint(20) NOT NULL AUTO_INCREMENT,
  `TableXYId` int(11) NOT NULL,
  `ZStateTypeId` int(11) NOT NULL,
  `AccountId` int(11) DEFAULT NULL COMMENT 'unused, can be dropped',
  `SessionId` bigint(20) DEFAULT NULL,
  `TimeChanged` datetime DEFAULT NULL,
  `Value` varchar(64) DEFAULT NULL,
  `ValueLong` longtext,
  PRIMARY KEY (`TableXYStateId`),
  UNIQUE KEY `UI_TableXYState_TableXYId` (`TableXYId`,`ZStateTypeId`),
  KEY `I_TableXYState_ZStateTypeId` (`ZStateTypeId`),
  CONSTRAINT `_FK_TableXYState_TableXYId` FOREIGN KEY (`TableXYId`) REFERENCES `TableXY` (`TableXYId`) ON DELETE CASCADE,
  CONSTRAINT `__FK_TableXYState_ZStateTypeId` FOREIGN KEY (`ZStateTypeId`) REFERENCES `ZStateType` (`ZStateTypeId`)
) ENGINE=InnoDB AUTO_INCREMENT=9093095802 DEFAULT CHARSET=utf8

Are these smaller tables referenced by the foreign keys of this larger table?

I’m wondering if the FKs are causing this issue. When you select from a table with FKs, there are inherent locks created on all the rows in the parent table along with all matching rows in every child table.

While the query is causing FC, run some basic diagnostics: SHOW ENGINE INNODB STATUS\G,
SHOW PROCESSLIST; SELECT * FROM sys.innodb_locks;

Q: Are these smaller tables referenced by the foreign keys of this larger table?

Yes and no: Queries to smaller tables are sometimes referenced by the large table, sometimes smaller tables in these queries does not relates to the large table(s). Any single query to smaller table usually does not cause FC. It seems that these small queries are causing FC only when more of these queries are running at once.
Whe have results from the diagnostic queries, but we consider them as sensitive info. Can we send it to you directly?