Simple select causing flow control pause of whole cluster

lebeda.jaromir · December 4, 2024, 12:24pm

Hello,
we are using MySQL with Percona XtraDB Cluster 5.7.44, 5 nodes (4 multimaster, 1 garb).

When we run run simple query - just SELECT COUNT(*) FROM SomeTable
on node, which is dedicated only for reading requests (there are usually no writeset operation except of replications), it is causing slow down of whole cluster. The node starts to send flow control messages which are leading to flow control pause of whole cluster.
However, if the same query is executed on different node - where both writeset and select operations are executed, it does not affect cluster performance. All nodes have the same HW and MySQL configuration. A few month ago these queries did not affect flow control. Since the time we migrated from MySQL 5.7.37 to 5.7.44 and moved OS from Centos to Ubuntu.

What could be the reason and how fix the situation?

Thank you in advance

Abhinav_Gupta · December 4, 2024, 12:37pm

Although the node is intended for read-only queries, the SELECT COUNT(*) query, especially when SomeTable is large, may cause significant disk I/O. This could result in the node lagging behind in replication, which triggers flow control.

Since all nodes are running on identical hardware, it’s important to ensure that the read-only node isn’t encountering any resource bottlenecks such as CPU, memory, or disk I/O. To diagnose this, monitor the following:

CPU usage ( top or htop )
Disk I/O ( iostat , vmstat )
Memory usage ( free -m or top )

matthewb · December 5, 2024, 2:14pm

Hello @lebeda.jaromir,
In addition to what Abhinav said, please verify the value of wsrep_sync_wait on all nodes. If this session parameter is set higher on this one node, then the SELECT could cause flow control.

I would also use this tool, GitHub - jayjanssen/myq_gadgets: myq_gadgets is deprecated by myq-tools!, and have it running while you execute the SELECT and see what is happening.

lebeda.jaromir · December 6, 2024, 10:20am

That is a valid point, though we have not observed any sign of HW bottleneck.

We use PMM to monitor our cluster. To give you a better picture, we have deliberately triggered the flow control scenario by running the SELECT on db2-prd node, at 12:18, and captured relevant PMM metrics from that time, so that you can compare it against normal operation.

The PXC Cluster Summary graph shows how db2-prd sends flow control message onto other nodes, triggering flow control across the other nodes.

The HW-related graphs, in our opinion, don’t show signs of any resource limits being reached at any point

lebeda.jaromir · December 6, 2024, 10:20am

Thank you for the tool, we will try it and I will come back with result

matthewb · December 6, 2024, 9:13pm

SELECT … FOR UPDATE? or SELECT … IN SHARE MODE?

lebeda.jaromir · December 9, 2024, 8:39am

It should be SELECT … IN SHARE MODE

lebeda.jaromir · December 9, 2024, 9:31am

please verify the value of wsrep_sync_wait on all nodes. If this session parameter is set higher on this one node, then the SELECT could cause flow control.

We have crosschecked it and we have wsrep_sync_wait = 0 on all nodes.

matthewb · December 10, 2024, 2:13am

Can you confirm that using the query without IN SHARE MODE, does not cause flow control? Locking in SHARE MODE locks rows against modification, which would block PXCs ability on this node, to process new transactions. Thus, this node sends out flow control panics.

lebeda.jaromir · December 10, 2024, 11:30am

Sorry, the query is executed without any other specification - neither “FOR UPDATE” nor “IN SHARE MODE”. It is causing flow control panics even it is running with isolation level read uncommited.

matthewb · December 10, 2024, 1:02pm

This is the only SELECT query that causes the issue? Other SELECT on this node are fine? If so, then there is something odd with the table. Is it MyISAM? Please confirm with SHOW CREATE TABLE. Anything in the local node error logs when you run the troublesome SELECT?

lebeda.jaromir · December 10, 2024, 2:19pm

No, there are more SELECTs that can cause the issue. We have 2 queries, which always do this (first query to partitioned table, but query is only to one partition with 250 mio rows; second query is related to not partitioned table with about 800 mio rows).
Also another different queries running at the same time to smaller tables can cause the FC panic. It means that this behaviour is not case of only one table only. All our tables are InnoDb.
When running the queries, we can see these type of warnings in error log (about 15 messages in 2 minutes):

[Warning] WSREP: Failed to report last committed 116370864992, -110 (Connection timed out)
Example of the the table definition:
CREATE TABLE `TableXYState` (
  `TableXYStateId` bigint(20) NOT NULL AUTO_INCREMENT,
  `TableXYId` int(11) NOT NULL,
  `ZStateTypeId` int(11) NOT NULL,
  `AccountId` int(11) DEFAULT NULL COMMENT 'unused, can be dropped',
  `SessionId` bigint(20) DEFAULT NULL,
  `TimeChanged` datetime DEFAULT NULL,
  `Value` varchar(64) DEFAULT NULL,
  `ValueLong` longtext,
  PRIMARY KEY (`TableXYStateId`),
  UNIQUE KEY `UI_TableXYState_TableXYId` (`TableXYId`,`ZStateTypeId`),
  KEY `I_TableXYState_ZStateTypeId` (`ZStateTypeId`),
  CONSTRAINT `_FK_TableXYState_TableXYId` FOREIGN KEY (`TableXYId`) REFERENCES `TableXY` (`TableXYId`) ON DELETE CASCADE,
  CONSTRAINT `__FK_TableXYState_ZStateTypeId` FOREIGN KEY (`ZStateTypeId`) REFERENCES `ZStateType` (`ZStateTypeId`)
) ENGINE=InnoDB AUTO_INCREMENT=9093095802 DEFAULT CHARSET=utf8

matthewb · December 10, 2024, 4:11pm

Are these smaller tables referenced by the foreign keys of this larger table?

I’m wondering if the FKs are causing this issue. When you select from a table with FKs, there are inherent locks created on all the rows in the parent table along with all matching rows in every child table.

While the query is causing FC, run some basic diagnostics: SHOW ENGINE INNODB STATUS\G,
SHOW PROCESSLIST; SELECT * FROM sys.innodb_locks;

lebeda.jaromir · December 11, 2024, 10:48am

Q: Are these smaller tables referenced by the foreign keys of this larger table?

Yes and no: Queries to smaller tables are sometimes referenced by the large table, sometimes smaller tables in these queries does not relates to the large table(s). Any single query to smaller table usually does not cause FC. It seems that these small queries are causing FC only when more of these queries are running at once.
Whe have results from the diagnostic queries, but we consider them as sensitive info. Can we send it to you directly?

Topic		Replies	Views
Is it possible that SELECT queries will cause a flow control issue for the Galera Cluster? Percona XtraDB Cluster 8.x	5	702	July 27, 2023
wsrep_flow_control_paused gets pretty high sometimes Percona XtraDB Cluster 5.x	1	4103	July 12, 2013
Mysql timeouts while running large update or mysql import Percona XtraDB Cluster 5.x	3	980	April 12, 2015
Flow control state initiated by asynchronous traffic Percona XtraDB Cluster 5.x	6	1232	April 27, 2015
Zombie Node Percona XtraDB Cluster 5.x	1	413	September 8, 2015

Simple select causing flow control pause of whole cluster

Related topics