Hi Guys I have some issue with a cluster. 2 nodes keep disconnecting with the same error every day an need manual recovery.
below is the details in the log at the time of the crash.
5:17:00 31496 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep=(exec_mode: 1 conflict_state: 5 seqno: 6071457)
5:17:00 31496 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 6071457
5:17:00 31496 [Warning] WSREP: Failed to apply app buffer: seqno: 6071457, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 2th time
…
Retrying 4th time
5:17:00 31496 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep=(exec_mode: 1 conflict_state: 5 seqno: 6071457)
5:17:00 31496 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 6071457
5:17:00 31496 [Warning] WSREP: failed to replay trx: source: 17232961-f35d-11e4-87a3-ab9ab4b84de5 version: 3 local: 1 state: REPLAYING flags: 1 conn_id: 62020 trx_id: 7769303 seqnos (l: 1216, g: 6071457, s: 6071455, d: 607145
6, ts: 318245279979880)
5:17:00 31496 [Warning] WSREP: Failed to apply trx 6071457 4 times
5:17:00 31496 [ERROR] WSREP: trx_replay failed for: 6, query: void
5:17:00 31496 [ERROR] Aborting
5:17:02 31496 [Note] WSREP: killing local connection: 62028
5:17:02 31496 [Note] WSREP: killing local connection: 62031
5:17:02 31496 [Note] WSREP: killing local connection: 62019
5:17:02 31496 [Note] WSREP: Closing send monitor…
5:17:02 31496 [Note] WSREP: Closed send monitor.
5:17:02 31496 [Note] WSREP: gcomm: terminating thread
5:17:02 31496 [Note] WSREP: gcomm: joining thread
5:17:02 31496 [Note] WSREP: gcomm: closing backend
5:17:02 31496 [Note] WSREP: view(view_id(NON_PRIM,12f32c3b,97) memb {
17232961,2
} joined {
} left {
} partitioned {
12f32c3b,1
39f50f96,2
cf874de5,2
})
5:17:02 31496 [Note] WSREP: view((empty))
5:17:02 31496 [Note] WSREP: New COMPONENT: primary=no, bootstrap=no, my_idx=0, memb_num=1
5:17:02 31496 [Note] WSREP: gcomm: closed
5:17:02 31496 [Note] WSREP: Flow-control interval: [16, 16]
5:17:02 31496 [Note] WSREP: Received NON-PRIMARY.
5:17:02 31496 [Note] WSREP: Shifting SYNCED → OPEN (TO: 6071459)
5:17:02 31496 [Note] WSREP: Received self-leave message.
5:17:02 31496 [Note] WSREP: Flow-control interval: [0, 0]
5:17:02 31496 [Note] WSREP: Received SELF-LEAVE. Closing connection.
5:17:02 31496 [Note] WSREP: Shifting OPEN → CLOSED (TO: 6071459)
5:17:02 31496 [Note] WSREP: RECV thread exiting 0: Success
5:17:02 31496 [Note] WSREP: recv_thread() joined.
5:17:02 31496 [Note] WSREP: Closing replication queue.
5:17:02 31496 [Note] WSREP: Closing slave action queue.
5:17:02 31496 [Note] WSREP: Service disconnected.
5:17:02 31496 [Note] WSREP: rollbacker thread exiting
5:17:03 31496 [Note] WSREP: Some threads may fail to exit.
5:17:03 31496 [Note] Binlog end
5:17:03 31496 [Note] Shutting down plugin ‘partition’
5:17:03 31496 [Note] Shutting down plugin ‘ARCHIVE’
5:17:03 31496 [Note] Shutting down plugin ‘InnoDB’
5:17:03 31496 [Note] InnoDB: FTS optimize thread exiting.
5:17:03 31496 [Note] InnoDB: Starting shutdown…
04:17:03 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at [url]https://bugs.launchpad.net/percona-xtradb-cluster[/url]
key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=15
max_threads=502
thread_count=12
connection_count=4
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads=208588 K bytes of memory
Hope that’s ok; if not, decrease some variables in the equation.
Thread pointer: 0xc6c9260
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
stack_bottom=7f4c785a9d38 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x8fa965]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x665644]
/lib64/libpthread.so.0[0x385940f710]
/usr/sbin/mysqld(_Z13gtid_rollbackP3THD+0x4a)[0x8855ea]
/usr/sbin/mysqld(_ZN13MYSQL_BIN_LOG8rollbackEP3THDb+0x129)[0x8b2719]
/usr/sbin/mysqld(_Z17ha_rollback_transP3THDb+0x74)[0x5a4e24]
/usr/sbin/mysqld(_Z14trans_rollbackP3THD+0x47)[0x78ec57]
/usr/sbin/mysqld(_ZN3THD7cleanupEv+0x25)[0x6b58e5]
/usr/sbin/mysqld(_ZN3THD17release_resourcesEv+0x288)[0x6b6558]
/usr/sbin/mysqld(_Z29one_thread_per_connection_endP3THDb+0x2e)[0x588f5e]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x101)[0x6bc4b1]
/usr/sbin/mysqld(handle_one_connection+0x47)[0x6bc717]
/usr/sbin/mysqld(pfs_spawn_thread+0x12a)[0xaf611a]
/lib64/libpthread.so.0[0x38594079d1]
/lib64/libc.so.6(clone+0x6d)[0x38590e88fd]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 62033
Status: KILL_CONNECTION
then on rejoin am getting below before the server fails to join
8:09:29 12986 [Note] WSREP: Flow-control interval: [28, 28]
8:09:29 12986 [Note] WSREP: Shifting OPEN → PRIMARY (TO: 6087090)
8:09:29 12986 [Note] WSREP: State transfer required:
Group state: 4f062c03-ee8e-11e4-b8b6-53722f757ac2:6087090
Local state: 00000000-0000-0000-0000-000000000000:-1
8:09:29 12986 [Note] WSREP: New cluster view: global state: 4f062c03-ee8e-11e4-b8b6-53722f757ac2:6087090, view# 17: Primary, number of nodes: 3, my index: 2, protocol version 3
8:09:29 12986 [Warning] WSREP: Gap in state sequence. Need state transfer.
8:09:29 12986 [Note] WSREP: Running: 'wsrep_sst_rsync --role ‘joiner’ --address ‘192.168.10.24’ --auth ‘xy’ --datadir ‘/var/lib/mysql/’ --defaults-file ‘/etc/my.cnf’ --parent ‘12986’ ‘’ ’
8:09:29 12986 [Note] WSREP: Prepared SST request: rsync|192.168.10.24:4444/rsync_sst
8:09:29 12986 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
8:09:29 12986 [Note] WSREP: REPL Protocols: 7 (3, 2)
8:09:29 12986 [Note] WSREP: Service thread queue flushed.
8:09:29 12986 [Note] WSREP: Assign initial position for certification: 6087090, protocol version: 3
8:09:29 12986 [Note] WSREP: Service thread queue flushed.
8:09:29 12986 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (4f062c03-ee8e-11e4-b8b6-53722f757ac2): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():456. IST will be unavailable.
8:09:29 12986 [Note] WSREP: Member 2.2 (server-DB4) requested state transfer from ‘any’. Selected 1.2 (server-DB2)(SYNCED) as donor.
8:09:29 12986 [Note] WSREP: Shifting PRIMARY → JOINER (TO: 6087090)
8:09:29 12986 [Note] WSREP: Requesting state transfer: success, donor: 1
8:09:31 12986 [Note] WSREP: (d5b3a4dc, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
8:16:11 12986 [Warning] WSREP: 1.2 (server-DB3): State transfer to 2.2 (server-DB4) failed: -255 (Unknown error 255)
8:16:11 12986 [ERROR] WSREP: gcs/src/gcs_group.cpp:int gcs_group_handle_join_msg(gcs_group_t*, const gcs_recv_msg_t*)():731: Will never receive state. Need to abort.
8:16:11 12986 [Note] WSREP: gcomm: terminating thread
8:16:11 12986 [Note] WSREP: gcomm: joining thread
8:16:11 12986 [Note] WSREP: gcomm: closing backend
8:16:11 12986 [Note] WSREP: view(view_id(NON_PRIM,12f32c3b,104) memb {
d5b3a4dc,2
} joined {
} left {
} partitioned {
12f32c3b,1
cf874de5,2
})
8:16:11 12986 [Note] WSREP: view((empty))
8:16:11 12986 [Note] WSREP: gcomm: closed
8:16:11 12986 [Note] WSREP: /usr/sbin/mysqld: Terminated.
1505068:16:11 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
WSREP_SST: [INFO] Joiner cleanup. (201505068:16:12.797)
WSREP_SST: [INFO] Joiner cleanup done. (201505068:16:13.329)
my.cnf configuration
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
binlog_format=ROW
bind-address=0.0.0.0
default-storage-engine=innodb
sysdate-is-now=1
expire-logs-days=14
innodb-flush-method=O_DIRECT
innodb-log-files-in-group =2
innodb-log-file-size=256M
innodb_log_buffer_size=96M
innodb-file-per-table=1
innodb-buffer-pool-size=6G
innodb_autoinc_lock_mode=2
innodb_flush_log_at_trx_commit=0
innodb_read_io_threads=4
innodb_write_io_threads=4
innodb_io_capacity=200
innodb_doublewrite=1
innodb_sched_priority_cleaner=5
innodb_sched_prio=39
max_sp_recursion_depth=255
group_concat_max_len=4294967295
lower_case_table_names=1
max_allowed_packet=1073741824
server_id=4
tmp-table-size=32M
max-heap-table-size=32M
query-cache-type=0
query-cache-size=0
max-connections =500
thread-cache-size=50
open-files-limit=65535
table-definition-cache=4096
table-open-cache=2048
wsrep_provider=/usr/lib64/galera3/libgalera_smm.so
wsrep_provider_options=“evs.version=1; gmcast.segment=2; gcache.size=5G; gcache.page_size=1G; evs.join_retrans_period=PT1S; evs.keepalive_period=PT2S; evs.inactive_check_period=PT10S; evs.suspect_timeout=PT30S; evs.inactive_timeout=PT1M; evs.install_timeout=PT1M; evs.send_window=1024; evs.user_send_window=512;”
wsrep_cluster_name=“Galera_Cluster”
wsrep_cluster_address=“gcomm://192.168.10.23,192.168.10.22,192.168.10.21,192.168.12.10”
wsrep_sst_method=rsync_wan
wsrep_node_name=servers-DB4
wsrep_node_address=192.168.10.24
wsrep_sst_auth=x:y
wsrep_slave_threads=8
wsrep_retry_autocommit=5
wsrep_max_ws_rows=131072
wsrep_max_ws_size=1073741824
wsrep_causal_reads=0
wsrep_certify_nonPK=1
please can you advise if this is caused by my config or if there is a bug I need to report.