[Galera] Failed to report last committed: -110 (Connection timed out)

I have a Percona XtraDB Cluster setup with 3 nodes at Ubuntu 20.04LTS

Server version: 8.0.22-13.1 Percona XtraDB Cluster (GPL), Release rel13, Revision a48e6d5, WSREP version 26.4.3

At node 1 I have a cron job launches update procedure every hour. Sometimes I can see that SQL UPDATE query has freezed in state “wsrep: replicating and certifying write set(-1)”

Full SQL statement: UPDATE goods SET params=CAST('{\"warranty_months\":24,\"type_case\":60,\"polarity\":57,\"type_cleat\":54,\"type_akb\":45,\"length\":242,\"width\":175,\"height\":190,\"current\":600,\"capacity\":60,\"tech\":48,\"start_stop\":null,\"promo\":51}' AS JSON), updated_at='2021-05-21 12:24:49' WHERE id=1713

The update procedure performs a large volume of such requests with similar data (we are talking about several hundreds) in one transaction. Each request has used a primary key (WHERE id=1713). The goods table not so huge - its have 1100 records only.

When I’ve got a freezed request in node 1, other nodes send in /var/log/mysql/error.log message:

2021-05-21T20:40:00.892654Z 0 [Warning] [MY-000000] [Galera] Failed to report last committed 796058d9-b8a2-11eb-9072-c25e8ba7694b:26442, -110 (Connection timed out)

If I force restart node 2 or node 3, the remaining node 1 successfully completes the hung request and successfully continues its work. The problem can be repeated both in the next cycle of the update procedure, and in a day or more.

Do I need to reduce the number of requests in one transaction?
I ask for recommendations to stabilize the cluster operation

2 Likes

Hi @serge - Wecome to the Percona Forums, thanks for posting!!
To be clear you are running this in a Stored Procedure? What happens when you run the UPDATE manually, do you see the same level of stalls? You may need to break up your Procedure / Transactions into smaller sized record modifications so that you affect less rows each iteration. Look at a new feature of PXC 8 called Streaming Replication as it may be worth using in your situation:

Further you can likely remove the CAST since PXC 8.0.22 supports the JSON data type, this will probably speed things up a little:
https://dev.mysql.com/doc/refman/8.0/en/json.html
Do you have PMM attached to this cluster? It would be helpful to see the volume of replicated bytes / Flow Control Sent / wsrep queue depth from the PXC Cluster Summary dashboard:
https://pmmdemo.percona.com/graph/d/pxc-cluster-summary/pxc-galera-cluster-summary

1 Like

it might be related to bug [PXC-3580] Aggressive network outages on one node makes the whole cluster unusable - Percona JIRA
it should be fixed in upcoming Percona XtraDB Cluster 8.0.23 release

1 Like