2 of 3 nodes crash

Hi,

we have cluster in LAN, 3 nodes.We run group of test(create DB, create tables, run coherently a lot of transactions, Drop DB).All write/read requests are going to first node. Sometimes, other two nodes simultaneously crash with such errors on both :

2015-06-23 21:36:30 30464 [ERROR] Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id
`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452
2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 2948798
2015-06-23 21:36:30 30464 [ERROR] Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id
`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452
2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 2948796
2015-06-23 21:36:30 30464 [Warning] WSREP: Failed to apply app buffer: seqno: 2948798, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
2015-06-23 21:36:30 30464 [Warning] WSREP: Failed to apply app buffer: seqno: 2948796, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
2015-06-23 21:36:30 30464 [ERROR] Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id
`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452
2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 2948798
2015-06-23 21:36:30 30464 [ERROR] Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id
`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452
2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 2948796
2015-06-23 21:36:30 30464 [Warning] WSREP: Failed to apply app buffer: seqno: 2948798, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2015-06-23 21:36:30 30464 [ERROR] Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id
`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452
2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 2948798

But it doesn’t reproduce 100%, we can run similar test two or three time(each time new DB) and only after fourth time 2 or 1 nodes crashed.
Also,there is no errors on first node(where we run test) and all test finished succesfully.
I tried to look into audit logs of global logs, but there is nothing. What else I can do? and what this errors means?

OS:centos 6.6
Percona 5.6 latest.

Seems to be the same with BUG report:
[URL]https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1264992[/URL]

Can you provide all my.cnf files from 3 nodes? as well as detailed information about “secondary” or “donor” nodes?
Was there any crash recovery with your nodes?

All three nodes have similar HW and SW configurations. After crash I just restart failed node with full SST.
Configurations below:

Node 1



[mysqld_safe]

[mysqld]
datadir = /percona_data
user = mysql

collation-server=utf8_unicode_ci
binlog_format=ROW

wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_cluster_address=gcomm://10.251.0.55,10.251.0.56,10.251.0.57

# Node #1 address
wsrep_node_address = 10.251.0.55
#Cluser name
wsrep_node_name=ru-db-01
wsrep_slave_threads=32
wsrep_cluster_name=001

wsrep_retry_autocommit = 3

# SST method
#wsrep_sst_method=xtrabackup
#wsrep_sst_auth=replication_user:jGFcv97JnkMJRw
wsrep_sst_method=rsync

# MyISAM storage engine has only experimental support

default_storage_engine=InnoDB

# This is a recommended tuning variable for performance

innodb_locks_unsafe_for_binlog=1

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2

character-set-server=utf8

log-error=/var/log/mysql-errors.log

# innodb engine settings
innodb-file-per-table = 1
innodb_buffer_pool_size = 256M
innodb_additional_mem_pool_size = 128M
innodb_flush_log_at_trx_commit = 0
innodb_flush_method = O_DIRECT
transaction-isolation = READ-COMMITTED
innodb-data-file-path = ibdata1:10M:autoextend
innodb-log-file-size = 128M
innodb_lock_wait_timeout = 3600

# performance settings
skip-name-resolve

query_cache_size = 32M
query_cache_type = 1
query_cache_min_res_unit = 1024

join_buffer_size = 1M
key_buffer = 160M
table_open_cache = 4000
table_definition_cache = 4000
sort_buffer = 1M
thread_cache = 128
thread_stack = 262144
tmp_table_size = 128M
max_heap_table_size = 128M
wait-timeout = 120
read_buffer_size = 32M
max_connections = 3500

table_open_cache = 3000
read-buffer-size = 524288

# network
connect_timeout = 600000
wait_timeout = 28800
max_allowed_packet = 512M
max_connect_errors = 1000
net_read_timeout=600000
connect_timeout=600000
net_write_timeout=600000

# safety settings
skip-show-database = 1
skip-symbolic-links = 1
local-infile = 0

event_scheduler=on
#optimizer_switch='derived_merge=off,derived_with_keys=off'
#wsrep_provider_options="gcache.size=5G; repl.commit_order=1;" #gmcast.segment=2"
wsrep_provider_options="gcache.size=5G;"

long_query_time = 4
slow_query_log=1
slow_query_log_file=/var/log/mysql-slow-query.log
#log-slow-queries = /var/log/mysqld-slow-query.log

Node 2

[mysqld_safe]

[mysqld]
datadir = /percona_data
user = mysql

collation-server=utf8_unicode_ci
binlog_format=ROW

wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_cluster_address=gcomm://10.251.0.55,10.251.0.56,10.251.0.57

# Node #1 address
wsrep_node_address = 10.251.0.56
#Cluser name
wsrep_node_name=ru-db-02
wsrep_slave_threads=32
wsrep_cluster_name=001

wsrep_retry_autocommit = 3

# SST method
#wsrep_sst_method=xtrabackup
wsrep_sst_method=rsync
wsrep_sst_auth=replication_user:jGFcv97JnkMJRw

#wsrep_sst_method=rsync

# MyISAM storage engine has only experimental support

default_storage_engine=InnoDB

# This is a recommended tuning variable for performance

innodb_locks_unsafe_for_binlog=1

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2

character-set-server=utf8

log-error=/var/log/mysql-errors.log

# innodb engine settings
innodb-file-per-table = 1
innodb_buffer_pool_size = 256M
innodb_additional_mem_pool_size = 128M
innodb_flush_log_at_trx_commit = 0
innodb_flush_method = O_DIRECT
transaction-isolation = READ-COMMITTED
innodb-data-file-path = ibdata1:10M:autoextend
innodb-log-file-size = 128M
innodb_lock_wait_timeout=3600

# performance settings
skip-name-resolve

query_cache_size = 32M
query_cache_type = 1
query_cache_min_res_unit = 1024

join_buffer_size = 1M
key_buffer = 160M
table_open_cache = 4000
table_definition_cache = 4000
sort_buffer = 1M
thread_cache = 128
thread_stack = 262144
tmp_table_size = 128M
max_heap_table_size = 128M
wait-timeout = 120
read_buffer_size = 32M
max_connections = 2000

table_open_cache = 3000
read-buffer-size = 524288

# network
connect_timeout = 600000
wait_timeout = 28800
max_allowed_packet = 512M
max_connect_errors = 1000
net_read_timeout=600000
connect_timeout=600000
net_write_timeout=600000

# safety settings
skip-show-database = 1
skip-symbolic-links = 1
local-infile = 0

event_scheduler=on
#optimizer_switch='derived_merge=off,derived_with_keys=off'
#wsrep_provider_options="gcache.size=5G; repl.commit_order=1;" #gmcast.segment=2"
wsrep_provider_options="gcache.size=5G;"

long_query_time = 0
slow_query_log=1
slow_query_log_file=/var/log/mysql-slow-query.log
#log-slow-queries = /var/log/mysqld-slow-query.log

Node 3

[mysqld_safe]

[mysqld]
datadir = /percona_data
user = mysql

collation-server=utf8_unicode_ci
binlog_format=ROW
log-bin = mysql-bin

wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_cluster_address=gcomm://10.251.0.55,10.251.0.56,10.251.0.57

# Node #1 address
wsrep_node_address = 10.251.0.57
#Cluser name
wsrep_node_name=ru-db-03
wsrep_slave_threads=32
wsrep_cluster_name=001

wsrep_retry_autocommit = 3

# SST method
#wsrep_sst_method=xtrabackup
wsrep_sst_method=rsync
wsrep_sst_auth=replication_user:jGFcv97JnkMJRw

#wsrep_sst_method=rsync

# MyISAM storage engine has only experimental support

default_storage_engine=InnoDB

# This is a recommended tuning variable for performance

innodb_locks_unsafe_for_binlog=1

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2

character-set-server=utf8

log-error=/var/log/mysql-errors.log

# innodb engine settings
innodb-file-per-table = 1
innodb_buffer_pool_size = 256M
innodb_additional_mem_pool_size = 128M
innodb_flush_log_at_trx_commit = 0
innodb_flush_method = O_DIRECT
transaction-isolation = READ-COMMITTED
innodb-data-file-path = ibdata1:10M:autoextend
innodb-log-file-size = 128M
innodb_lock_wait_timeout=3600

# performance settings
skip-name-resolve

query_cache_size = 32M
query_cache_type = 1
query_cache_min_res_unit = 1024

join_buffer_size = 1M
key_buffer = 160M
table_open_cache = 4000
table_definition_cache = 4000
sort_buffer = 1M
thread_cache = 128
thread_stack = 262144
tmp_table_size = 128M
max_heap_table_size = 128M
wait-timeout = 120
read_buffer_size = 32M
max_connections = 2000

table_open_cache = 3000
read-buffer-size = 524288

# network
connect_timeout = 600000
wait_timeout = 28800
max_allowed_packet = 512M
max_connect_errors = 1000
net_read_timeout=600000
connect_timeout=600000
net_write_timeout=600000

# safety settings
skip-show-database = 1
skip-symbolic-links = 1
local-infile = 0

event_scheduler=on
#optimizer_switch='derived_merge=off,derived_with_keys=off'
#wsrep_provider_options="gcache.size=5G; repl.commit_order=1;" #gmcast.segment=2"
wsrep_provider_options="gcache.size=5G;"
long_query_time = 0
slow_query_log=1
slow_query_log_file=/var/log/mysql-slow-query.log
#log-slow-queries = /var/log/mysqld-slow-query.log

server-id=1

Error indicates Foreign key issue :


Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id `) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452 2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 294879

GRA file on failed node look like this:

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#140909 16:18:49 server id 2 end_log_pos 120 CRC32 0x186105da Start: binlog v 4, server v 5.6.20-68.0-56-log created 140909 16:18:49 at startup
# Warning: this binlog is either in use or was not closed properly.
ROLLBACK/*!*/;
BINLOG '
KfAOVA8CAAAAdAAAAHgAAAABAAQANS42LjIwLTY4LjAtNTYtbG9nAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAp8A5UEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAdoF
YRg=
'/*!*/;
# at 120
#140909 16:18:49 server id 2 end_log_pos 231 CRC32 0x746c0eae Previous-GTIDs
# 7b2c47a7-34d4-11e4-8087-0800272e951c:1,
# 88de2b68-cb2b-ee1b-4ac5-b9ce06787689:1-25
# at 231
#150624 14:59:51 server id 0 end_log_pos 83 CRC32 0x4e494745 Query thread_id=658 exec_time=0 error_code=0
SET TIMESTAMP=1435147191/*!*/;
SET @@session.pseudo_thread_id=658/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=3, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=192/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
B
/*!*/;
# at 314
#150624 14:59:51 server id 0 end_log_pos 151 CRC32 0x04000303 Table_map: `test-rain-1002`.`machine_services` mapped to number 2489
# at 382
#150624 14:59:51 server id 0 end_log_pos 207 CRC32 0x00000001 Write_rows: table id 2489 flags: STMT_END_F

BINLOG '
t5uKVRMAAAAARAAAAJcAAAAAALkJAAAAAAEADnRlc3QtcmFpbi0xMDAyABBtYWNoaW5lX3NlcnZp
Y2VzAAQICAMDAAQ=
t5uKVR4AAAAAOAAAAM8AAAAAALkJAAAAAAEAAgAE//AcAAAAAAAAAIsAAAAAAAAAAQAAAAEAAAA=
'/*!*/;
# at 438
#150624 14:59:51 server id 0 end_log_pos 275 CRC32 0x04000303 Table_map: `test-rain-1002`.`machine_services` mapped to number 2489
# at 506
#150624 14:59:51 server id 0 end_log_pos 331 CRC32 0x00000001 Write_rows: table id 2489 flags: STMT_END_F

BINLOG '
t5uKVRMAAAAARAAAABMBAAAAALkJAAAAAAEADnRlc3QtcmFpbi0xMDAyABBtYWNoaW5lX3NlcnZp
Y2VzAAQICAMDAAQ=
t5uKVR4AAAAAOAAAAEsBAAAAALkJAAAAAAEAAgAE//AfAAAAAAAAAIsAAAAAAAAAAgAAAAEAAAA=
'/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

Here is more detailed:

ysqlbinlog -vvv GRA_bin.log 
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#140909 16:18:49 server id 2 end_log_pos 120 CRC32 0x186105da Start: binlog v 4, server v 5.6.20-68.0-56-log created 140909 16:18:49 at startup
# Warning: this binlog is either in use or was not closed properly.
ROLLBACK/*!*/;
BINLOG '
KfAOVA8CAAAAdAAAAHgAAAABAAQANS42LjIwLTY4LjAtNTYtbG9nAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAp8A5UEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAdoF
YRg=
'/*!*/;
# at 120
#140909 16:18:49 server id 2 end_log_pos 231 CRC32 0x746c0eae Previous-GTIDs
# 7b2c47a7-34d4-11e4-8087-0800272e951c:1,
# 88de2b68-cb2b-ee1b-4ac5-b9ce06787689:1-25
# at 231
#150624 14:59:51 server id 0 end_log_pos 83 CRC32 0x4e494745 Query thread_id=658 exec_time=0 error_code=0
SET TIMESTAMP=1435147191/*!*/;
SET @@session.pseudo_thread_id=658/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=3, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=192/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
B
/*!*/;
# at 314
#150624 14:59:51 server id 0 end_log_pos 151 CRC32 0x04000303 Table_map: `test-rain-1002`.`machine_services` mapped to number 2489
# at 382
#150624 14:59:51 server id 0 end_log_pos 207 CRC32 0x00000001 Write_rows: table id 2489 flags: STMT_END_F

BINLOG '
t5uKVRMAAAAARAAAAJcAAAAAALkJAAAAAAEADnRlc3QtcmFpbi0xMDAyABBtYWNoaW5lX3NlcnZp
Y2VzAAQICAMDAAQ=
t5uKVR4AAAAAOAAAAM8AAAAAALkJAAAAAAEAAgAE//AcAAAAAAAAAIsAAAAAAAAAAQAAAAEAAAA=
'/*!*/;
### INSERT INTO `test-rain-1002`.`machine_services`
### SET
### @1=28 /* LONGINT meta=0 nullable=0 is_null=0 */
### @2=139 /* LONGINT meta=0 nullable=0 is_null=0 */
### @3=1 /* INT meta=0 nullable=0 is_null=0 */
### @4=***Corrupted replication event was detected. Not printing the value***
# at 438
#150624 14:59:51 server id 0 end_log_pos 275 CRC32 0x04000303 Table_map: `test-rain-1002`.`machine_services` mapped to number 2489
# at 506
#150624 14:59:51 server id 0 end_log_pos 331 CRC32 0x00000001 Write_rows: table id 2489 flags: STMT_END_F

BINLOG '
t5uKVRMAAAAARAAAABMBAAAAALkJAAAAAAEADnRlc3QtcmFpbi0xMDAyABBtYWNoaW5lX3NlcnZp
Y2VzAAQICAMDAAQ=
t5uKVR4AAAAAOAAAAEsBAAAAALkJAAAAAAEAAgAE//AfAAAAAAAAAIsAAAAAAAAAAgAAAAEAAAA=
'/*!*/;
### INSERT INTO `test-rain-1002`.`machine_services`
### SET
### @1=31 /* LONGINT meta=0 nullable=0 is_null=0 */
### @2=139 /* LONGINT meta=0 nullable=0 is_null=0 */
### @3=2 /* INT meta=0 nullable=0 is_null=0 */
### @4=***Corrupted replication event was detected. Not printing the value***
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

Hi All
was it somehow resolved?
we have the same issue from time to time

the source code at client side looks like

open a connection
start a transction
call a stored procedure

the store procedure has 2 inserts into a master-details table
first it inserts into table1 and gets the ID inserted
and then inserts into table2. table 2 has a foreign key to table1.

commit the transaction
close the connection

the second node can crash from time time to time with
Cannot add or update a child row: a foreign key constraint fails…
Error_code: 1452; handler error HA_ERR_
NO_REFERENCED_ROW; the event’s master log FIRST…

so it looks like the replication is trying to apply the SECOND insert before the FIRST one and crashes with the FK error

any advise how to handle it to avoid the above?