Transaction is accepted but lost due to cluster timeout

Hi,

I’m running a 3 node Percona XtraDB Cluster, version 5.7.35-38-57-log.

It’s used for HA, meaning that our application only write/read from a single node and only switches to another node if the first one fails.

We had a brief network timeout in the main node and noticed that at least one transaction was lost. It was accepted by the main node ( no errors reported from Mysql ) but the 3 records created by it were lost.

Other transactions were also running but Percona returned an error ( as expected ) and we knew that the records were not created.

Fortunately we had an audit log with all SQL commands in our application and were able to recreate the records.

Besides that, all nodes are consistent and the tables have the same number of records/data. The main node recovered automatically as expected.

What could cause this issue? I thought that Percona Cluster would only accept a transaction when all nodes confirmed that it was received. But in this case, it looks like the transaction was lost.

This is log of the main node when the issue happened:

2023-01-06T13:16:52.486794Z 25722799 [Note] Got an error reading communication packets
2023-01-06T13:16:52.486794Z 25722800 [Note] Got an error reading communication packets
2023-01-06T13:16:53.254133Z 25722801 [Note] Got an error reading communication packets
2023-01-06T13:16:53.826187Z 0 [Warning] WSREP: last inactive check more than PT1.5S (3evs.inactive_check_period) ago (PT1.92549S), skipping check
2023-01-06T13:16:55.214330Z 0 [Warning] WSREP: last inactive check more than PT1.5S (3
evs.inactive_check_period) ago (PT2.16205S), skipping check
2023-01-06T13:16:55.894537Z 25722804 [Note] Got an error reading communication packets
2023-01-06T13:16:56.011289Z 25722805 [Note] Got an error reading communication packets
2023-01-06T13:16:56.211994Z 25722803 [Note] Got an error reading communication packets
2023-01-06T13:16:56.557882Z 25722806 [Note] Got an error reading communication packets
2023-01-06T13:16:56.668552Z 25722808 [Note] Got an error reading communication packets
2023-01-06T13:16:57.199672Z 25722811 [Note] Got an error reading communication packets
2023-01-06T13:16:58.006007Z 25722816 [Note] Got an error reading communication packets
2023-01-06T13:17:25.913475Z 0 [Warning] WSREP: Failed to report last committed 9483525, -110 (Connection timed out)

Node 2 log:

2023-01-06T13:16:57.411487Z 0 [Warning] WSREP: Failed to report last committed 9483680, -110 (Connection timed out)
2023-01-06T13:17:33.619112Z 0 [Warning] WSREP: Failed to report last committed 9483787, -110 (Connection timed out)

Node 3 log:

2023-01-06T13:16:57.403986Z 0 [Warning] WSREP: Failed to report last committed 9483680, -110 (Connection timed out)
2023-01-06T13:17:34.621377Z 0 [Warning] WSREP: Failed to report last committed 9483787, -110 (Connection timed out)

1 Like

It was accepted by the main node

What exactly does this mean?

Did the application receive a COMMIT ACK from MySQL? (Do you have this logged in the application as evidence?) If not, then this is not a guarantee that the transaction succeeded.

Can you find this trx in node1s binary log?

When the application sends the SQL ‘COMMIT’, the trx is synchronously replicated to the other nodes. All nodes must acknowledge this replication back to node1 before node1 can actually commit the trx. If node1 crashes anywhere during these steps, the trx is considered lost and never happened.

1 Like