Percona xtradb cluster crash: HA_ERR_ROW_IS_REFERENCED

Hi,

We migrated a series of 100G Databases into different Galera Instances.Its running fine so far. I do have 5-6 years experience with Galera.
The cluster has 5 Nodes and the Loadbalancers always targets 1 Node for Read/Write and only takes the second(third…) when the primary node is down (or reports wsrep_ready off).

Version is the same on all servers:
Server version: 5.6.37-82.2-56-log Percona XtraDB Cluster (GPL), Release rel82.2, Revision 114f2f2, WSREP version 26.21, wsrep_26.21

In the night, always roughly at the same time, all 4 non-active nodes drop-out of the cluster at the same time with this error:
2018-01-29 01:26:56 33304 [ERROR] Slave SQL: Could not execute Delete_rows event on table storedb.object; Cannot delete or update a parent row: a foreign key constraint fails (storedb.object, CONSTRAINT fk_object_object FOREIGN KEY (parentid) REFERENCES object (objectid)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED; the event’s master log FIRST, end_log_pos 250, Error_code: 1451
2018-01-29 01:26:56 33304 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 152, 3097067
2018-01-29 01:26:56 33304 [Warning] WSREP: Failed to apply app buffer: seqno: 3097067, status: 1

The table is looking like this:
| object | CREATE TABLE object (
objectid int(11) NOT NULL AUTO_INCREMENT,
classid int(11) NOT NULL,
parentid int(11) DEFAULT NULL,
alias varchar(255) COLLATE utf8_bin NOT NULL,
siteid int(11) NOT NULL,
PRIMARY KEY (objectid),
UNIQUE KEY u_object_guid (guid),
UNIQUE KEY u_object_alias (parentid,alias),
KEY i_object_class (classid),
KEY i_object_siteid_classid_alias (siteid,classid,alias),
CONSTRAINT fk_object_object FOREIGN KEY (parentid) REFERENCES object (objectid),
CONSTRAINT fk_object_object_1 FOREIGN KEY (siteid) REFERENCES object (objectid)
) ENGINE=InnoDB AUTO_INCREMENT= DEFAULT CHARSET=utf8 COLLATE=utf8_bin |

The Query that is generating the error is really easy:
DELETE FROM storedb.object WHERE objectid=‘highnumber’;

I saved the datadir from the moment before the cluster crashed (just kept one node out of the cluster after the crash and started it without wsrep provider to have a state to play with).
I can issue the query above with no problem. It deletes the row.

A corresponding GRA.log file is generated but mysqbinlog complains that it cant read the event:
ERROR: Error in Log_event::read_log_event(): ‘Found invalid event in binary log’, data_len: 105, event_type: 32

I am running weekly consistency checks (percona pt-table-checksum) and they report no inconsistency.

Any idea what is happening here?

Greetings
julian

Hey,

as an update: It is NOT happening when wsrep_slave_threads are set to 1.
But this is not a solution as it slows down the Cluster.

Hi,

Do you have reproducible test-case? I tried looking at the output but don’t foresee any immediate clue.

Hi,

I am still working on that. It seems quite hard to trigger (with my limited skills in programming). With bash, mysql binary and mysqlimport, even in parrallel i can’t trigger it. I do have code inside a big application i can run and it crashes in a few seconds but that is nothing i can provide so far.

Greetings