Occassional node failure (FK related)

Hi all,

We have a 3 node cluster set up, with one of the three nodes acting as a replication slave to a standard Percona Server. There is no other traffic on the server at this time.

Strangely, we’re finding that every now and then one node (sometimes 2 nodes) will die due to node consistency being compromised . The transactions in question are being committed without problem on the replication master and the node in the cluster that is acting as replication slave.

We’re currently running the latest version of Percona Cluster as installed from the Percona repository (Percona-XtraDB-Cluster-server-5.5.28-23.7.369.rhel6.x86_64 and Percona-XtraDB-Cluster-galera-2.0-1.117.rhel6.x86_64) and have the following configuration applied on all three nodes (with appropriate address changes for various wsrep settings):

[mysqld_safe]wsrep_urls=gcomm://,gcomm://[mysqld]datadir=/var/lib/mysqluser=mysqlmax_connections=4000server-id=40binlog_format=ROWdefault_storage_engine=InnoDBlong_query_time=10log_error=/var/log/mysql/error_loglog_slow_queries=/var/log/mysql/slow_query_loglog_slow_verbosity=fullsync_binlog=0log_bin = bin-loglog_slave_updates = 1wsrep_node_address=

When the error does occur, the following error occurs (always involving the same tables):

121123 15:03:32 [ERROR] Slave SQL: Could not execute Write_rows event on table nablive.t_schedule_payment; Cannot add or update a child row: a foreign key constraint fails (ntproddb.t_schedule_payment, CONSTRAINT t_schedule_payment_ibfk_1 FOREIGN KEY (sched_id) REFERENCES t_schedule (id) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event’s master log FIRST, end_log_pos 232, Error_code: 1452121123 15:03:32 [Warning] WSREP: RBR event 2 Write_rows apply warning: 151, 1663670121123 15:03:32 [ERROR] WSREP: Failed to apply trx: source: eed70aa6-346b-11e2-0800-24241b746e7d version: 2 local: 0 state: APPLYING flags: 1 conn_id: 4 trx_id: 676177914 seqnos (l: 59090, g: 1663670, s: 1663669, d: 1663365, ts: 1353643412747819106)121123 15:03:32 [ERROR] WSREP: Failed to apply app buffer: ���P�, seqno: 1663670, status: WSREP_FATAL at galera/src/replicator_smm.cpp:apply_wscoll():49 at galera/src/replicator_smm.cpp:apply_trx_ws():120121123 15:03:32 [ERROR] WSREP: Node consistency compromized, aborting…121123 15:03:32 [Note] WSREP: Closing send monitor…121123 15:03:32 [Note] WSREP: Closed send monitor.121123 15:03:32 [Note] WSREP: gcomm: terminating thread121123 15:03:32 [Note] WSREP: gcomm: joining thread121123 15:03:32 [Note] WSREP: gcomm: closing backend121123 15:03:32 [Note] WSREP: view(view_id(NON_PRIM,05aece68-3511-11e2-0800-5e30f373f04f,9) memb { 05aece68-3511-11e2-0800-5e30f373f04f,} joined {} left {} partitioned { eed70aa6-346b-11e2-0800-24241b746e7d, fbc2a20c-3503-11e2-0800-2365c6b05d93,})121123 15:03:32 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1121123 15:03:32 [Note] WSREP: view((empty))121123 15:03:32 [Note] WSREP: gcomm: closed121123 15:03:32 [Note] WSREP: Flow-control interval: [16, 16]121123 15:03:32 [Note] WSREP: Received NON-PRIMARY.121123 15:03:32 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 1663674)121123 15:03:32 [Note] WSREP: Received self-leave message.121123 15:03:32 [Note] WSREP: Flow-control interval: [0, 0]121123 15:03:32 [Note] WSREP: Received SELF-LEAVE. Closing connection.121123 15:03:32 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 1663674)121123 15:03:32 [Note] WSREP: RECV thread exiting 0: Success121123 15:03:32 [Note] WSREP: recv_thread() joined.121123 15:03:32 [Note] WSREP: Closing slave action queue.121123 15:03:32 [Note] WSREP: /usr/sbin/mysqld: Terminated.121123 15:03:33 mysqld_safe Number of processes running now: 0121123 15:03:33 mysqld_safe WSREP: not restarting wsrep node automatically

The create table statements for the two tables in the FK relationship that are at the centre of this error are:

CREATE TABLE t_schedule ( id int(11) NOT NULL AUTO_INCREMENT, name varchar(50) NOT NULL, version int(11) NOT NULL, client_ref varchar(50) NOT NULL, freq int(11) NOT NULL, start_dt datetime NOT NULL, to_dt datetime DEFAULT NULL, exec_cnt int(11) DEFAULT NULL, created_by int(11) NOT NULL, created_on datetime NOT NULL, payment_type int(11) NOT NULL, lastmod_by int(11) DEFAULT NULL, lastmod_on datetime DEFAULT NULL, merchant_id varchar(6) NOT NULL, type_id smallint(6) NOT NULL DEFAULT ‘0’, KEY merchant_id_fk (merchant_id), KEY id (id), KEY fk_schedule_isof_type (type_id), CONSTRAINT merchant_id_fk FOREIGN KEY (merchant_id) REFERENCES t_merchants (Merchant_ID), CONSTRAINT t_schedule_ibfk_1 FOREIGN KEY (type_id) REFERENCES t_schedule_type (id) ) ENGINE=InnoDB AUTO_INCREMENT=71507 DEFAULT CHARSET=latin1 CREATE TABLE t_schedule_payment ( id int(11) NOT NULL AUTO_INCREMENT, version int(11) NOT NULL, processed bit(1) NOT NULL, cc_holder varchar(50) DEFAULT NULL, cc_exp_yy varchar(2) DEFAULT NULL, cc_exp_mm varchar(2) DEFAULT NULL, created_by int(11) NOT NULL, created_on datetime NOT NULL, cc_no varchar(60) DEFAULT NULL, transfer_dt datetime NOT NULL, lastmod_by int(11) DEFAULT NULL, lastmod_on datetime DEFAULT NULL, amt int(11) DEFAULT NULL, canceled bit(1) NOT NULL, pan varchar(12) DEFAULT NULL, sched_id int(11) NOT NULL, currency char(3) NOT NULL DEFAULT ‘AUD’, card_seq_no char(3) DEFAULT NULL, crn varchar(20) DEFAULT NULL, dd_bsb varchar(20) DEFAULT NULL, dd_acc_no varchar(50) DEFAULT NULL, dd_acc_name varchar(255) DEFAULT NULL, batch_id int(11) DEFAULT NULL, batch_trxid int(11) DEFAULT NULL, KEY id (id), KEY sched_id_fk (sched_id), CONSTRAINT t_schedule_payment_ibfk_1 FOREIGN KEY (sched_id) REFERENCES t_schedule (id) ON DELETE CASCADE) ENGINE=InnoDB AUTO_INCREMENT=3964335 DEFAULT CHARSET=latin1

To rule out data corruption / referential integrity issues, we’ve run the following query on all the available nodes and gotten the same results:

mysql> select distinct sched_id from t_schedule_payment where sched_id not in (select id from t_schedule);Empty set (2.47 sec)

I’ve seen a few mentions of FK strangeness in this forum recently but it appeared that various patches had solved this issue. Is there something obvious (or not obvious, or undocumented etc) that we’re missing here?

If there’s any further information we can provide, we definitely will.

I have the same issues but with primary keys and duplicate keys that in theory should not even happen.

anyone figure this out?

Foreign keys are pretty tricky it seems in this kind of clustering solution, plus parallel applying makes it even more complicated. However many FK-related bugs were fixed till now.
This thread is pretty old, but Shockwavecs - do you experience anything similar?