PXC 5.6.24 - BF applier failed to open_and_lock_tables

I am running a 3 node cluster of PXC and I keep getting random crashes on all 3 nodes.

Setup:
Ubuntu 14.04.2 LTS

  • 60 GB RAM
  • SSD RAID 10 (130GB)
    PXC 5.6.24-72.2-56-log - Percona XtraDB Cluster (GPL), Release rel72.2, Revision 43abf03, WSREP version 25.11, wsrep_25.11

[mysqld]

GENERAL

bind-address = 0.0.0.0
character-set-server = utf8
collation-server = utf8_general_ci
default_storage_engine = InnoDB
event-scheduler = ON
pid-file = /var/run/mysqld/mysqld.pid
port = 3306
server-id = 1
socket = /var/run/mysqld/mysqld.sock
user = mysql

MyISAM

key-buffer-size = 32M
myisam-recover-options = FORCE,BACKUP

SAFETY

innodb = FORCE
innodb-strict-mode = 1
max-allowed-packet = 64M
max-connect-errors = 1000000
skip-external-locking
skip-host-cache
skip-name-resolve
sql-mode = STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_AUTO_VA LUE_ON_ZERO,NO_ENGINE_SUBSTITUTION
sysdate-is-now = 1

DATA STORAGE

datadir = /var/lib/mysql

BINARY LOGGING

expire-logs-days = 14
log-bin = /var/lib/mysql/mysql-bin
log-slave-updates
sync-binlog = 1

CACHES AND LIMITS

back-log = 1000
connect-timeout = 20
interactive-timeout = 30
join-buffer-size = 8M
max-binlog-size = 100M
max-connections = 2000
max-heap-table-size = 32M
open-files-limit = 65535
preload-buffer-size = 65536
query-cache-size = 0
query-cache-type = 0
sort-buffer-size = 2M
read-buffer-size = 4M
read-rnd-buffer-size = 4M
table-definition-cache = 4096
table-open-cache = 5000
thread-cache-size = 100
thread-stack = 256K
tmp-table-size = 32M
wait-timeout = 30

INNODB

innodb-buffer-pool-instances = 8
innodb-buffer-pool-size = 40G
innodb-file-per-table = 1
innodb-flush-log-at-trx-commit = 1
innodb-flush-method = O_DIRECT
innodb-lock-wait-timeout = 15
innodb-log-files-in-group = 2
innodb-log-file-size = 512M

LOGGING *

log-error = /var/log/mysql/mysql-error.log
log-queries-not-using-indexes = 0
slow-query-log = 0

WSREP

wsrep_provider = /usr/lib/galera3/libgalera_smm.so
wsrep_cluster_address = gcomm://,,
binlog_format = ROW
innodb_autoinc_lock_mode = 2
wsrep_node_address =
wsrep_node_name = “db01”
wsrep_sst_method = xtrabackup-v2
wsrep_cluster_name =
wsrep_sst_auth = “”
wsrep_slave_threads = 8
wsrep_notify_cmd = /etc/mysql/wsrep_notify
<<

I was getting crashes every 1-3 days on all 3 nodes until the release of PXC 5.6.24. Now I get them about once a week. At first, I thought it was a specific cron job because the crash timestamps had similar minutes, so I looked at the job but I couldn’t replicate the error manually. Then the crash timestamps started to be different so I haven’t been able to track any pattern. When the nodes crash, I get the same error across the board:

2015-07-09 00:44:36 19638 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 46677454)
2015-07-09 00:44:36 19638 [Warning] WSREP: RBR event 3 Write_rows apply warning: 1615, 46677454
2015-07-09 00:44:36 19638 [Warning] WSREP: Failed to apply app buffer: seqno: 46677454, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 2th time
2015-07-09 00:44:36 19638 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 46677454)
2015-07-09 00:44:36 19638 [Warning] WSREP: RBR event 3 Write_rows apply warning: 1615, 46677454
2015-07-09 00:44:36 19638 [Warning] WSREP: Failed to apply app buffer: seqno: 46677454, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
2015-07-09 00:44:36 19638 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 46677454)
2015-07-09 00:44:36 19638 [Warning] WSREP: RBR event 3 Write_rows apply warning: 1615, 46677454
2015-07-09 00:44:36 19638 [Warning] WSREP: Failed to apply app buffer: seqno: 46677454, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2015-07-09 00:44:36 19638 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 46677454)
2015-07-09 00:44:36 19638 [Warning] WSREP: RBR event 3 Write_rows apply warning: 1615, 46677454
2015-07-09 00:44:36 19638 [Warning] WSREP: failed to replay trx: source: f3b46697-1ff6-11e5-af61-0b245f7246eb version: 3 local: 1 state: REPLAYING flags: 129 conn_id: 6286358 trx_id: 366113962 seqnos (l: 6786080, g: 46677454, s: 46677452, d: 46677453, ts: 2721954221674858)
2015-07-09 00:44:36 19638 [Warning] WSREP: Failed to apply trx 46677454 4 times
2015-07-09 00:44:36 19638 [ERROR] WSREP: trx_replay failed for: 6, query: void
2015-07-09 00:44:36 19638 [ERROR] Aborting

After the node fails, it ALWAYS has to do a SST (instead of an IST). I have tried to Google around and I have found several people having the same issue, but no resolutions. Is this a configuration problem? A bug that needs to be reported?

Any help would be appreciated. Thanks in advance!