Cluster failure

Hello,
I experienced a strange outage on my cluster and am trying to understand how it might have happened. All nodes but one failed with:


2023-12-16T13:32:19.215529Z 69 [ERROR] Slave SQL: Could not execute Delete_rows event on table db1.table1; Can't find record in 'table1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 189, Error_code: 1032

On the node that remain stable (and I did SST from this node to all others) in bin log I found:


### DELETE FROM `db1`.`table1`
### WHERE
###   @2='unique value' # @2 is PK column, my PK is one column and this is varchar(28)
# at 1063084213
#231216 13:32:19 server id XXX  end_log_pos 1063084244 CRC32 0x7bbada40 Xid = 23984160517

I am really curious how it possible that data was exists only on one node. What else I can check? (unfortunately I have no backup of all bin logs and can’t dump it on the other nodes).

P.S. yes I know the version is really old and I am in progress to move to percona 8.1 + group replication + ProxySQL

Hi egel,

excluding the bug option for such old version a possible explanation is the following.

It is possible on PXC to intentionally disable replication for a session setting wsrep_on session variable.

SET wsrep_on = OFF

This parameter defines whether or not updates made in the current session replicate to the cluster. It does not cause the node to leave the cluster and the node continues to communicate with other nodes. Additionally, it is a session variable. Defining it through the SET GLOBAL syntax also affects future sessions.

So, if you deleted rows on a session having replication disabled you caused data inconsistencies. You got an error on that DELETE statement but it is possible you had around other missing or altered data.

The only way to recover inconsistencies is doing an SST, and you did correctly.

This is the best explanation I can provide.

Thanks.
Theoretically, you are absolutely right; someone could execute SET wsrep_on = OFF on the session level and insert a value into the table that leads to failure. However, this is highly unlikely since this specific table is only used by the application, and no one has any, even theoretical, reason to insert data there. It’s worth noting that access to the database server is limited to very few trusted individuals, and no one else has the right to even select, let alone change data.

Moreover, I am certain that the suspicious row was inserted back in November (part of the PK represents date and time). I happen to have error logs from each cluster node from November, and no errors occurred at this specific time. Is there any bug (I didn’t find one) that may be related to the unsuccessful replication of a varchar-based PK? It’s a very intriguing situation, I must admit.

I appreciate your time very much