Transaction is accepted but lost due to cluster timeout

fernandomm · January 6, 2023, 2:45pm

Hi,

I’m running a 3 node Percona XtraDB Cluster, version 5.7.35-38-57-log.

It’s used for HA, meaning that our application only write/read from a single node and only switches to another node if the first one fails.

We had a brief network timeout in the main node and noticed that at least one transaction was lost. It was accepted by the main node ( no errors reported from Mysql ) but the 3 records created by it were lost.

Other transactions were also running but Percona returned an error ( as expected ) and we knew that the records were not created.

Fortunately we had an audit log with all SQL commands in our application and were able to recreate the records.

Besides that, all nodes are consistent and the tables have the same number of records/data. The main node recovered automatically as expected.

What could cause this issue? I thought that Percona Cluster would only accept a transaction when all nodes confirmed that it was received. But in this case, it looks like the transaction was lost.

This is log of the main node when the issue happened:

2023-01-06T13:16:52.486794Z 25722799 [Note] Got an error reading communication packets
2023-01-06T13:16:52.486794Z 25722800 [Note] Got an error reading communication packets
2023-01-06T13:16:53.254133Z 25722801 [Note] Got an error reading communication packets
2023-01-06T13:16:53.826187Z 0 [Warning] WSREP: last inactive check more than PT1.5S (3evs.inactive_check_period) ago (PT1.92549S), skipping check
2023-01-06T13:16:55.214330Z 0 [Warning] WSREP: last inactive check more than PT1.5S (3evs.inactive_check_period) ago (PT2.16205S), skipping check
2023-01-06T13:16:55.894537Z 25722804 [Note] Got an error reading communication packets
2023-01-06T13:16:56.011289Z 25722805 [Note] Got an error reading communication packets
2023-01-06T13:16:56.211994Z 25722803 [Note] Got an error reading communication packets
2023-01-06T13:16:56.557882Z 25722806 [Note] Got an error reading communication packets
2023-01-06T13:16:56.668552Z 25722808 [Note] Got an error reading communication packets
2023-01-06T13:16:57.199672Z 25722811 [Note] Got an error reading communication packets
2023-01-06T13:16:58.006007Z 25722816 [Note] Got an error reading communication packets
2023-01-06T13:17:25.913475Z 0 [Warning] WSREP: Failed to report last committed 9483525, -110 (Connection timed out)

Node 2 log:

2023-01-06T13:16:57.411487Z 0 [Warning] WSREP: Failed to report last committed 9483680, -110 (Connection timed out)
2023-01-06T13:17:33.619112Z 0 [Warning] WSREP: Failed to report last committed 9483787, -110 (Connection timed out)

Node 3 log:

2023-01-06T13:16:57.403986Z 0 [Warning] WSREP: Failed to report last committed 9483680, -110 (Connection timed out)
2023-01-06T13:17:34.621377Z 0 [Warning] WSREP: Failed to report last committed 9483787, -110 (Connection timed out)

matthewb · January 8, 2023, 2:55pm

It was accepted by the main node

What exactly does this mean?

Did the application receive a COMMIT ACK from MySQL? (Do you have this logged in the application as evidence?) If not, then this is not a guarantee that the transaction succeeded.

Can you find this trx in node1s binary log?

When the application sends the SQL ‘COMMIT’, the trx is synchronously replicated to the other nodes. All nodes must acknowledge this replication back to node1 before node1 can actually commit the trx. If node1 crashes anywhere during these steps, the trx is considered lost and never happened.

Topic		Replies	Views
Node failure during transaction -- what happens? Percona XtraDB Cluster 5.x	0	414	November 6, 2014
Percona Xtradb Cluster Master node down Percona XtraDB Cluster 5.x	1	112	January 22, 2025
my percona xtraDB cluster suddently dead and how to fix it Percona XtraDB Cluster 8.x community , mysql , percona	14	2300	August 27, 2020
Database cluster hang problem Percona XtraDB Cluster 8.x mysql , percona	5	1220	March 14, 2023
All nodes in cluster failed. Percona XtraDB Cluster 5.x	2	811	February 17, 2014

Transaction is accepted but lost due to cluster timeout

Related topics