[Galera] Failed to report last committed: -110 (Connection timed out)

serge · May 24, 2021, 4:12pm

I have a Percona XtraDB Cluster setup with 3 nodes at Ubuntu 20.04LTS

Server version: 8.0.22-13.1 Percona XtraDB Cluster (GPL), Release rel13, Revision a48e6d5, WSREP version 26.4.3

At node 1 I have a cron job launches update procedure every hour. Sometimes I can see that SQL UPDATE query has freezed in state “wsrep: replicating and certifying write set(-1)”

Full SQL statement: UPDATE goods SET params=CAST('{\"warranty_months\":24,\"type_case\":60,\"polarity\":57,\"type_cleat\":54,\"type_akb\":45,\"length\":242,\"width\":175,\"height\":190,\"current\":600,\"capacity\":60,\"tech\":48,\"start_stop\":null,\"promo\":51}' AS JSON), updated_at='2021-05-21 12:24:49' WHERE id=1713

The update procedure performs a large volume of such requests with similar data (we are talking about several hundreds) in one transaction. Each request has used a primary key (WHERE id=1713). The goods table not so huge - its have 1100 records only.

When I’ve got a freezed request in node 1, other nodes send in /var/log/mysql/error.log message:

2021-05-21T20:40:00.892654Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed 796058d9-b8a2-11eb-9072-c25e8ba7694b:26442, -110 (Connection timed out)

If I force restart node 2 or node 3, the remaining node 1 successfully completes the hung request and successfully continues its work. The problem can be repeated both in the next cycle of the update procedure, and in a day or more.

Do I need to reduce the number of requests in one transaction?
I ask for recommendations to stabilize the cluster operation

Michael_Coburn · May 24, 2021, 6:18pm

Hi @serge - Wecome to the Percona Forums, thanks for posting!!
To be clear you are running this in a Stored Procedure? What happens when you run the UPDATE manually, do you see the same level of stalls? You may need to break up your Procedure / Transactions into smaller sized record modifications so that you affect less rows each iteration. Look at a new feature of PXC 8 called Streaming Replication as it may be worth using in your situation:

Further you can likely remove the CAST since PXC 8.0.22 supports the JSON data type, this will probably speed things up a little:
https://dev.mysql.com/doc/refman/8.0/en/json.html
Do you have PMM attached to this cluster? It would be helpful to see the volume of replicated bytes / Flow Control Sent / wsrep queue depth from the PXC Cluster Summary dashboard:
https://pmmdemo.percona.com/graph/d/pxc-cluster-summary/pxc-galera-cluster-summary

vadimtk · May 24, 2021, 6:31pm

it might be related to bug [PXC-3580] Aggressive network outages on one node makes the whole cluster unusable - Percona JIRA
it should be fixed in upcoming Percona XtraDB Cluster 8.0.23 release

Topic		Replies	Views
Insert update slow sometimes Percona XtraDB Cluster 5.x percona	11	1252	April 25, 2022
percona xtradb 5.6.37 cluster crash Percona XtraDB Cluster 5.x	1	735	December 15, 2017
Percona XtraDB Cluster 5.6.40-26.25 Is Now Available Percona XtraDB Cluster 5.x	0	383	June 21, 2018
Replication + Galera = Timeout? Percona XtraDB Cluster 5.x	2	14030	July 31, 2012
Percona if down two node Percona XtraDB Cluster 8.x	6	403	February 7, 2024

[Galera] Failed to report last committed: -110 (Connection timed out)

Related topics