2 of 3 nodes crash

Hi,

we have cluster in LAN, 3 nodes.We run group of test(create DB, create tables, run coherently a lot of transactions, Drop DB).All write/read requests are going to first node. Sometimes, other two nodes simultaneously crash with such errors on both :

2015-06-23 21:36:30 30464 [ERROR] Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id
`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452
2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 2948798
2015-06-23 21:36:30 30464 [ERROR] Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id
`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452
2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 2948796
2015-06-23 21:36:30 30464 [Warning] WSREP: Failed to apply app buffer: seqno: 2948798, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
2015-06-23 21:36:30 30464 [Warning] WSREP: Failed to apply app buffer: seqno: 2948796, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
2015-06-23 21:36:30 30464 [ERROR] Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id
`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452
2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 2948798
2015-06-23 21:36:30 30464 [ERROR] Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id
`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452
2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 2948796
2015-06-23 21:36:30 30464 [Warning] WSREP: Failed to apply app buffer: seqno: 2948798, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2015-06-23 21:36:30 30464 [ERROR] Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id
`) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452
2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 2948798

But it doesn’t reproduce 100%, we can run similar test two or three time(each time new DB) and only after fourth time 2 or 1 nodes crashed.
Also,there is no errors on first node(where we run test) and all test finished succesfully.
I tried to look into audit logs of global logs, but there is nothing. What else I can do? and what this errors means?

OS:centos 6.6
Percona 5.6 latest.

Seems to be the same with BUG report:
https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1264992

Can you provide all my.cnf files from 3 nodes? as well as detailed information about “secondary” or “donor” nodes?
Was there any crash recovery with your nodes?

All three nodes have similar HW and SW configurations. After crash I just restart failed node with full SST.
Configurations below:

Node 1



[mysqld_safe]

[mysqld]
datadir = /percona_data
user = mysql

collation-server=utf8_unicode_ci
binlog_format=ROW

wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_cluster_address=gcomm://10.251.0.55,10.251.0.56,10.251.0.57

# Node #1 address
wsrep_node_address = 10.251.0.55
#Cluser name
wsrep_node_name=ru-db-01
wsrep_slave_threads=32
wsrep_cluster_name=001

wsrep_retry_autocommit = 3

# SST method
#wsrep_sst_method=xtrabackup
#wsrep_sst_auth=replication_user:jGFcv97JnkMJRw
wsrep_sst_method=rsync

# MyISAM storage engine has only experimental support

default_storage_engine=InnoDB

# This is a recommended tuning variable for performance

innodb_locks_unsafe_for_binlog=1

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2

character-set-server=utf8

log-error=/var/log/mysql-errors.log

# innodb engine settings
innodb-file-per-table = 1
innodb_buffer_pool_size = 256M
innodb_additional_mem_pool_size = 128M
innodb_flush_log_at_trx_commit = 0
innodb_flush_method = O_DIRECT
transaction-isolation = READ-COMMITTED
innodb-data-file-path = ibdata1:10M:autoextend
innodb-log-file-size = 128M
innodb_lock_wait_timeout = 3600

# performance settings
skip-name-resolve

query_cache_size = 32M
query_cache_type = 1
query_cache_min_res_unit = 1024

join_buffer_size = 1M
key_buffer = 160M
table_open_cache = 4000
table_definition_cache = 4000
sort_buffer = 1M
thread_cache = 128
thread_stack = 262144
tmp_table_size = 128M
max_heap_table_size = 128M
wait-timeout = 120
read_buffer_size = 32M
max_connections = 3500

table_open_cache = 3000
read-buffer-size = 524288

# network
connect_timeout = 600000
wait_timeout = 28800
max_allowed_packet = 512M
max_connect_errors = 1000
net_read_timeout=600000
connect_timeout=600000
net_write_timeout=600000

# safety settings
skip-show-database = 1
skip-symbolic-links = 1
local-infile = 0

event_scheduler=on
#optimizer_switch='derived_merge=off,derived_with_keys=off'
#wsrep_provider_options="gcache.size=5G; repl.commit_order=1;" #gmcast.segment=2"
wsrep_provider_options="gcache.size=5G;"

long_query_time = 4
slow_query_log=1
slow_query_log_file=/var/log/mysql-slow-query.log
#log-slow-queries = /var/log/mysqld-slow-query.log

Node 2

[mysqld_safe]

[mysqld]
datadir = /percona_data
user = mysql

collation-server=utf8_unicode_ci
binlog_format=ROW

wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_cluster_address=gcomm://10.251.0.55,10.251.0.56,10.251.0.57

# Node #1 address
wsrep_node_address = 10.251.0.56
#Cluser name
wsrep_node_name=ru-db-02
wsrep_slave_threads=32
wsrep_cluster_name=001

wsrep_retry_autocommit = 3

# SST method
#wsrep_sst_method=xtrabackup
wsrep_sst_method=rsync
wsrep_sst_auth=replication_user:jGFcv97JnkMJRw

#wsrep_sst_method=rsync

# MyISAM storage engine has only experimental support

default_storage_engine=InnoDB

# This is a recommended tuning variable for performance

innodb_locks_unsafe_for_binlog=1

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2

character-set-server=utf8

log-error=/var/log/mysql-errors.log

# innodb engine settings
innodb-file-per-table = 1
innodb_buffer_pool_size = 256M
innodb_additional_mem_pool_size = 128M
innodb_flush_log_at_trx_commit = 0
innodb_flush_method = O_DIRECT
transaction-isolation = READ-COMMITTED
innodb-data-file-path = ibdata1:10M:autoextend
innodb-log-file-size = 128M
innodb_lock_wait_timeout=3600

# performance settings
skip-name-resolve

query_cache_size = 32M
query_cache_type = 1
query_cache_min_res_unit = 1024

join_buffer_size = 1M
key_buffer = 160M
table_open_cache = 4000
table_definition_cache = 4000
sort_buffer = 1M
thread_cache = 128
thread_stack = 262144
tmp_table_size = 128M
max_heap_table_size = 128M
wait-timeout = 120
read_buffer_size = 32M
max_connections = 2000

table_open_cache = 3000
read-buffer-size = 524288

# network
connect_timeout = 600000
wait_timeout = 28800
max_allowed_packet = 512M
max_connect_errors = 1000
net_read_timeout=600000
connect_timeout=600000
net_write_timeout=600000

# safety settings
skip-show-database = 1
skip-symbolic-links = 1
local-infile = 0

event_scheduler=on
#optimizer_switch='derived_merge=off,derived_with_keys=off'
#wsrep_provider_options="gcache.size=5G; repl.commit_order=1;" #gmcast.segment=2"
wsrep_provider_options="gcache.size=5G;"

long_query_time = 0
slow_query_log=1
slow_query_log_file=/var/log/mysql-slow-query.log
#log-slow-queries = /var/log/mysqld-slow-query.log

Node 3

[mysqld_safe]

[mysqld]
datadir = /percona_data
user = mysql

collation-server=utf8_unicode_ci
binlog_format=ROW
log-bin = mysql-bin

wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_cluster_address=gcomm://10.251.0.55,10.251.0.56,10.251.0.57

# Node #1 address
wsrep_node_address = 10.251.0.57
#Cluser name
wsrep_node_name=ru-db-03
wsrep_slave_threads=32
wsrep_cluster_name=001

wsrep_retry_autocommit = 3

# SST method
#wsrep_sst_method=xtrabackup
wsrep_sst_method=rsync
wsrep_sst_auth=replication_user:jGFcv97JnkMJRw

#wsrep_sst_method=rsync

# MyISAM storage engine has only experimental support

default_storage_engine=InnoDB

# This is a recommended tuning variable for performance

innodb_locks_unsafe_for_binlog=1

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2

character-set-server=utf8

log-error=/var/log/mysql-errors.log

# innodb engine settings
innodb-file-per-table = 1
innodb_buffer_pool_size = 256M
innodb_additional_mem_pool_size = 128M
innodb_flush_log_at_trx_commit = 0
innodb_flush_method = O_DIRECT
transaction-isolation = READ-COMMITTED
innodb-data-file-path = ibdata1:10M:autoextend
innodb-log-file-size = 128M
innodb_lock_wait_timeout=3600

# performance settings
skip-name-resolve

query_cache_size = 32M
query_cache_type = 1
query_cache_min_res_unit = 1024

join_buffer_size = 1M
key_buffer = 160M
table_open_cache = 4000
table_definition_cache = 4000
sort_buffer = 1M
thread_cache = 128
thread_stack = 262144
tmp_table_size = 128M
max_heap_table_size = 128M
wait-timeout = 120
read_buffer_size = 32M
max_connections = 2000

table_open_cache = 3000
read-buffer-size = 524288

# network
connect_timeout = 600000
wait_timeout = 28800
max_allowed_packet = 512M
max_connect_errors = 1000
net_read_timeout=600000
connect_timeout=600000
net_write_timeout=600000

# safety settings
skip-show-database = 1
skip-symbolic-links = 1
local-infile = 0

event_scheduler=on
#optimizer_switch='derived_merge=off,derived_with_keys=off'
#wsrep_provider_options="gcache.size=5G; repl.commit_order=1;" #gmcast.segment=2"
wsrep_provider_options="gcache.size=5G;"
long_query_time = 0
slow_query_log=1
slow_query_log_file=/var/log/mysql-slow-query.log
#log-slow-queries = /var/log/mysqld-slow-query.log

server-id=1

Error indicates Foreign key issue :


Slave SQL: Could not execute Write_rows event on table test-rain-17165.machine_services; Cannot add or update a child row: a foreign key constraint fails (`test-rain-17165`.`machine_services`, CONSTRAINT `machine_registration_fk` FOREIGN KEY (`machine_id`) REFERENCES `machines` (`id `) ON DELETE NO ACTION ON UPDATE NO ACTION), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 209, Error_code: 1452 2015-06-23 21:36:30 30464 [Warning] WSREP: RBR event 3 Write_rows apply warning: 151, 294879

GRA file on failed node look like this:

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#140909 16:18:49 server id 2 end_log_pos 120 CRC32 0x186105da Start: binlog v 4, server v 5.6.20-68.0-56-log created 140909 16:18:49 at startup
# Warning: this binlog is either in use or was not closed properly.
ROLLBACK/*!*/;
BINLOG '
KfAOVA8CAAAAdAAAAHgAAAABAAQANS42LjIwLTY4LjAtNTYtbG9nAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAp8A5UEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAdoF
YRg=
'/*!*/;
# at 120
#140909 16:18:49 server id 2 end_log_pos 231 CRC32 0x746c0eae Previous-GTIDs
# 7b2c47a7-34d4-11e4-8087-0800272e951c:1,
# 88de2b68-cb2b-ee1b-4ac5-b9ce06787689:1-25
# at 231
#150624 14:59:51 server id 0 end_log_pos 83 CRC32 0x4e494745 Query thread_id=658 exec_time=0 error_code=0
SET TIMESTAMP=1435147191/*!*/;
SET @@session.pseudo_thread_id=658/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=3, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=192/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
B
/*!*/;
# at 314
#150624 14:59:51 server id 0 end_log_pos 151 CRC32 0x04000303 Table_map: `test-rain-1002`.`machine_services` mapped to number 2489
# at 382
#150624 14:59:51 server id 0 end_log_pos 207 CRC32 0x00000001 Write_rows: table id 2489 flags: STMT_END_F

BINLOG '
t5uKVRMAAAAARAAAAJcAAAAAALkJAAAAAAEADnRlc3QtcmFpbi0xMDAyABBtYWNoaW5lX3NlcnZp
Y2VzAAQICAMDAAQ=
t5uKVR4AAAAAOAAAAM8AAAAAALkJAAAAAAEAAgAE//AcAAAAAAAAAIsAAAAAAAAAAQAAAAEAAAA=
'/*!*/;
# at 438
#150624 14:59:51 server id 0 end_log_pos 275 CRC32 0x04000303 Table_map: `test-rain-1002`.`machine_services` mapped to number 2489
# at 506
#150624 14:59:51 server id 0 end_log_pos 331 CRC32 0x00000001 Write_rows: table id 2489 flags: STMT_END_F

BINLOG '
t5uKVRMAAAAARAAAABMBAAAAALkJAAAAAAEADnRlc3QtcmFpbi0xMDAyABBtYWNoaW5lX3NlcnZp
Y2VzAAQICAMDAAQ=
t5uKVR4AAAAAOAAAAEsBAAAAALkJAAAAAAEAAgAE//AfAAAAAAAAAIsAAAAAAAAAAgAAAAEAAAA=
'/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

Here is more detailed:

ysqlbinlog -vvv GRA_bin.log 
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#140909 16:18:49 server id 2 end_log_pos 120 CRC32 0x186105da Start: binlog v 4, server v 5.6.20-68.0-56-log created 140909 16:18:49 at startup
# Warning: this binlog is either in use or was not closed properly.
ROLLBACK/*!*/;
BINLOG '
KfAOVA8CAAAAdAAAAHgAAAABAAQANS42LjIwLTY4LjAtNTYtbG9nAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAp8A5UEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAdoF
YRg=
'/*!*/;
# at 120
#140909 16:18:49 server id 2 end_log_pos 231 CRC32 0x746c0eae Previous-GTIDs
# 7b2c47a7-34d4-11e4-8087-0800272e951c:1,
# 88de2b68-cb2b-ee1b-4ac5-b9ce06787689:1-25
# at 231
#150624 14:59:51 server id 0 end_log_pos 83 CRC32 0x4e494745 Query thread_id=658 exec_time=0 error_code=0
SET TIMESTAMP=1435147191/*!*/;
SET @@session.pseudo_thread_id=658/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=3, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=192/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
B
/*!*/;
# at 314
#150624 14:59:51 server id 0 end_log_pos 151 CRC32 0x04000303 Table_map: `test-rain-1002`.`machine_services` mapped to number 2489
# at 382
#150624 14:59:51 server id 0 end_log_pos 207 CRC32 0x00000001 Write_rows: table id 2489 flags: STMT_END_F

BINLOG '
t5uKVRMAAAAARAAAAJcAAAAAALkJAAAAAAEADnRlc3QtcmFpbi0xMDAyABBtYWNoaW5lX3NlcnZp
Y2VzAAQICAMDAAQ=
t5uKVR4AAAAAOAAAAM8AAAAAALkJAAAAAAEAAgAE//AcAAAAAAAAAIsAAAAAAAAAAQAAAAEAAAA=
'/*!*/;
### INSERT INTO `test-rain-1002`.`machine_services`
### SET
### @1=28 /* LONGINT meta=0 nullable=0 is_null=0 */
### @2=139 /* LONGINT meta=0 nullable=0 is_null=0 */
### @3=1 /* INT meta=0 nullable=0 is_null=0 */
### @4=***Corrupted replication event was detected. Not printing the value***
# at 438
#150624 14:59:51 server id 0 end_log_pos 275 CRC32 0x04000303 Table_map: `test-rain-1002`.`machine_services` mapped to number 2489
# at 506
#150624 14:59:51 server id 0 end_log_pos 331 CRC32 0x00000001 Write_rows: table id 2489 flags: STMT_END_F

BINLOG '
t5uKVRMAAAAARAAAABMBAAAAALkJAAAAAAEADnRlc3QtcmFpbi0xMDAyABBtYWNoaW5lX3NlcnZp
Y2VzAAQICAMDAAQ=
t5uKVR4AAAAAOAAAAEsBAAAAALkJAAAAAAEAAgAE//AfAAAAAAAAAIsAAAAAAAAAAgAAAAEAAAA=
'/*!*/;
### INSERT INTO `test-rain-1002`.`machine_services`
### SET
### @1=31 /* LONGINT meta=0 nullable=0 is_null=0 */
### @2=139 /* LONGINT meta=0 nullable=0 is_null=0 */
### @3=2 /* INT meta=0 nullable=0 is_null=0 */
### @4=***Corrupted replication event was detected. Not printing the value***
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;