Cluster failure

egel · December 21, 2023, 11:59pm

Hello,
I experienced a strange outage on my cluster and am trying to understand how it might have happened. All nodes but one failed with:


2023-12-16T13:32:19.215529Z 69 [ERROR] Slave SQL: Could not execute Delete_rows event on table db1.table1; Can't find record in 'table1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 189, Error_code: 1032

On the node that remain stable (and I did SST from this node to all others) in bin log I found:


### DELETE FROM `db1`.`table1`
### WHERE
###   @2='unique value' # @2 is PK column, my PK is one column and this is varchar(28)
# at 1063084213
#231216 13:32:19 server id XXX  end_log_pos 1063084244 CRC32 0x7bbada40 Xid = 23984160517

I am really curious how it possible that data was exists only on one node. What else I can check? (unfortunately I have no backup of all bin logs and can’t dump it on the other nodes).

P.S. yes I know the version is really old and I am in progress to move to percona 8.1 + group replication + ProxySQL

Corrado_Pandiani · December 22, 2023, 2:54pm

Hi egel,

excluding the bug option for such old version a possible explanation is the following.

It is possible on PXC to intentionally disable replication for a session setting wsrep_on session variable.

SET wsrep_on = OFF

This parameter defines whether or not updates made in the current session replicate to the cluster. It does not cause the node to leave the cluster and the node continues to communicate with other nodes. Additionally, it is a session variable. Defining it through the SET GLOBAL syntax also affects future sessions.

So, if you deleted rows on a session having replication disabled you caused data inconsistencies. You got an error on that DELETE statement but it is possible you had around other missing or altered data.

The only way to recover inconsistencies is doing an SST, and you did correctly.

This is the best explanation I can provide.

egel · December 23, 2023, 7:26pm

Thanks.
Theoretically, you are absolutely right; someone could execute SET wsrep_on = OFF on the session level and insert a value into the table that leads to failure. However, this is highly unlikely since this specific table is only used by the application, and no one has any, even theoretical, reason to insert data there. It’s worth noting that access to the database server is limited to very few trusted individuals, and no one else has the right to even select, let alone change data.

Moreover, I am certain that the suspicious row was inserted back in November (part of the PK represents date and time). I happen to have error logs from each cluster node from November, and no errors occurred at this specific time. Is there any bug (I didn’t find one) that may be related to the unsuccessful replication of a varchar-based PK? It’s a very intriguing situation, I must admit.

I appreciate your time very much

Topic		Replies	Views
Ha_err_key_not_found Percona XtraDB Cluster 5.x	1	1708	October 2, 2018
Percona xtradb cluster crash: HA_ERR_ROW_IS_REFERENCED Percona XtraDB Cluster 5.x	3	1666	February 15, 2018
One node in the cluster gets stopped Percona XtraDB Cluster 5.x	7	5325	July 31, 2014
PXC 5.6 : second and third node shutdown Percona XtraDB Cluster 5.x	1	672	October 24, 2017
Problem with replication Percona XtraDB Cluster 5.x	4	1380	January 4, 2021

Cluster failure

Related topics