PXC Node Crashed

One of our node in a 3 node PXC went down with the below error trace. From the error suspect that the node was not able to apply the replicated trx. Is this a bug ?

PXC : 5.6.35-80.0-56-log running on CentOS 7 (3.10.0-514.10.2.el7.x86_64)

2018-06-20 08:43:28 26088 [Warning] WSREP: failed to replay trx: source: 7ee807fc-73dc-11e8-8169-fa8168816846 version: 3 local: 1 state: EXECUTING flags: 0 conn_id: -1 trx_id: 2214966172 seqnos (l: -1, g: -1, s: -1, d: -1, ts: 1529477008413707629)
2018-06-20 08:43:28 26088 [Warning] WSREP: Invalid state in replay for trx source: 7ee807fc-73dc-11e8-8169-fa8168816846 version: 3 local: 1 state: EXECUTING flags: 0 conn_id: -1 trx_id: 2214966172 seqnos (l: -1, g: -1, s: -1, d: -1, ts: 1529477008413707629) (FATAL)
at galera/src/replicator_smm.cpp:replay_trx():936
2018-06-20 08:43:28 26088 [ERROR] WSREP: trx_replay failed for: 6, schema: abcabc, query: CALL spc_DA_AggregateData(6, 1529407320, 1529407380)
2018-06-20 08:43:28 26088 [ERROR] Aborting

2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 21074
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 16693
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 36
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 16692
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 16796
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 13900
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 20635
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 16627
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 17
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 21085
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 23586
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 286454
2018-06-20 08:43:30 26088 [Note] WSREP: killing local connection: 286406
2018-06-20 08:43:30 26088 [Note] WSREP: Closing send monitor…
2018-06-20 08:43:30 26088 [Note] WSREP: Closed send monitor.
2018-06-20 08:43:30 26088 [Note] WSREP: gcomm: terminating thread
2018-06-20 08:43:30 26088 [Note] WSREP: gcomm: joining thread
2018-06-20 08:43:30 26088 [Note] WSREP: gcomm: closing backend
2018-06-20 08:43:30 26088 [Warning] WSREP: failed to replay trx: source: 7ee807fc-73dc-11e8-8169-fa8168816846 version: 3 local: 1 state: ABORTING flags: 1 conn_id: 286250 trx_id: 2214966262 seqnos (l: -1, g: -1, s: -1, d: -1, ts: 1529477010472257755)
2018-06-20 08:43:30 26088 [Warning] WSREP: Invalid state in replay for trx source: 7ee807fc-73dc-11e8-8169-fa8168816846 version: 3 local: 1 state: ABORTING flags: 1 conn_id: 286250 trx_id: 2214966262 seqnos (l: -1, g: -1, s: -1, d: -1, ts: 1529477010472257755) (FATAL)
at galera/src/replicator_smm.cpp:replay_trx():936
2018-06-20 08:43:30 26088 [ERROR] WSREP: trx_replay failed for: 6, schema: abcabc, query: CALL spc_DA_UpdateAggregationJobStatus(71,18484,1)
2018-06-20 08:43:30 26088 [ERROR] Aborting

2018-06-20 08:43:31 26088 [Warning] Aborted connection 286454 to db: ‘abcabc’ user: ‘abcabc’ host: ‘10.1.3.101’ (Deadlock found when trying to get lock; try restarting transacti)
2018-06-20 08:43:32 26088 [Note] WSREP: killing local connection: 286507
2018-06-20 08:43:32 26088 [Note] WSREP: killing local connection: 286564
2018-06-20 08:43:32 26088 [Note] WSREP: killing local connection: 286568
2018-06-20 08:43:32 26088 [Note] WSREP: killing local connection: 286580
2018-06-20 08:43:32 26088 [Note] WSREP: Service disconnected.
2018-06-20 08:43:32 26088 [Note] WSREP: rollbacker thread exiting
2018-06-20 08:43:33 26088 [Note] WSREP: (7ee807fc, ‘tcp://0.0.0.0:4567’) connection to peer 69ddcedb with addr tcp://10.1.3.107:4567 timed out, no messages seen in PT3S
2018-06-20 08:43:33 26088 [Note] WSREP: (7ee807fc, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.1.3.107:4567
2018-06-20 08:43:33 26088 [Note] WSREP: Some threads may fail to exit.
2018-06-20 08:43:33 26088 [Note] Binlog end
06:43:33 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at [url]https://bugs.launchpad.net/percona-xtradb-cluster[/url]

key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=212
max_threads=302
thread_count=221
connection_count=212
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 136709 K bytes of memory
Hope that’s ok; if not, decrease some variables in the equation.

Thread pointer: 0x2ce0c20
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
stack_bottom = 7fcfad2f6cf0 thread_stack 0x40000

Do you have reproducible test-case ?

Generally, there shouldn’t be any sane reason for replay of trx to fail.

One of the node crashed again this time have full stack trace…

2018-06-20 13:35:07 3029 [Warning] WSREP: failed to replay trx: source: 7ee807fc-73dc-11e8-8169-fa8168816846 version: 3 local: 1 state: EXECUTING flags: 0 conn_id: -1 trx_id: 2218293683 seqnos (l: -1, g: -1, s: -1, d: -1, ts: 1529494507636648148)
2018-06-20 13:35:07 3029 [Warning] WSREP: Invalid state in replay for trx source: 7ee807fc-73dc-11e8-8169-fa8168816846 version: 3 local: 1 state: EXECUTING flags: 0 conn_id: -1 trx_id: 2218293683 seqnos (l: -1, g: -1, s: -1, d: -1, ts: 1529494507636648148) (FATAL)
at galera/src/replicator_smm.cpp:replay_trx():936
2018-06-20 13:35:07 3029 [ERROR] WSREP: trx_replay failed for: 6, schema: abcabc, query: CALL spc_DA_AggregateRawLdsData(6, 1529426160, 1529426220)
2018-06-20 13:35:07 3029 [ERROR] Aborting

2018-06-20 13:35:09 3029 [Note] WSREP: Closing send monitor…
2018-06-20 13:35:09 3029 [Note] WSREP: Closed send monitor.
2018-06-20 13:35:09 3029 [Note] WSREP: gcomm: terminating thread
2018-06-20 13:35:09 3029 [Note] WSREP: gcomm: joining thread
2018-06-20 13:35:09 3029 [Note] WSREP: gcomm: closing backend
2018-06-20 13:35:09 3029 [Warning] Aborted connection 1672081 to db: ‘abcabc’ user: ‘abcabc’ host: ‘10.1.3.101’ (Deadlock found when trying to get lock; try restarting transacti)
2018-06-20 13:35:10 3029 [Warning] Aborted connection 1672089 to db: ‘abcabc’ user: ‘abcabc’ host: ‘10.1.3.101’ (Deadlock found when trying to get lock; try restarting transacti)
2018-06-20 13:35:13 3029 [Note] WSREP: (7ee807fc, ‘tcp://0.0.0.0:4567’) connection to peer 69ddcedb with addr tcp://10.1.3.107:4567 timed out, no messages seen in PT3S
2018-06-20 13:35:13 3029 [Note] WSREP: (7ee807fc, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://10.1.3.107:4567
2018-06-20 13:35:14 3029 [Note] WSREP: (7ee807fc, ‘tcp://0.0.0.0:4567’) reconnecting to 69ddcedb (tcp://10.1.3.107:4567), attempt 0
2018-06-20 13:35:15 3029 [Note] WSREP: evs::proto(7ee807fc, LEAVING, view_id(REG,69ddcedb,6)) suspecting node: 69ddcedb
2018-06-20 13:35:15 3029 [Note] WSREP: evs::proto(7ee807fc, LEAVING, view_id(REG,69ddcedb,6)) suspected node without join message, declaring inactive
2018-06-20 13:35:15 3029 [Note] WSREP: view(view_id(NON_PRIM,69ddcedb,6) memb {
7ee807fc,0
} joined {
} left {
} partitioned {
69ddcedb,0
})
2018-06-20 13:35:15 3029 [Note] WSREP: view((empty))
2018-06-20 13:35:15 3029 [Note] WSREP: gcomm: closed
2018-06-20 13:35:15 3029 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-06-20 13:35:15 3029 [Note] WSREP: Flow-control interval: [16, 16]
2018-06-20 13:35:15 3029 [Note] WSREP: Received NON-PRIMARY.
2018-06-20 13:35:15 3029 [Note] WSREP: Shifting SYNCED → OPEN (TO: 255865376)
2018-06-20 13:35:15 3029 [Note] WSREP: Received self-leave message.
2018-06-20 13:35:15 3029 [Note] WSREP: Flow-control interval: [0, 0]
2018-06-20 13:35:15 3029 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2018-06-20 13:35:15 3029 [Note] WSREP: Shifting OPEN → CLOSED (TO: 255865376)
2018-06-20 13:35:15 3029 [Note] WSREP: RECV thread exiting 0: Success
2018-06-20 13:35:15 3029 [Note] WSREP: recv_thread() joined.
2018-06-20 13:35:15 3029 [Note] WSREP: Closing replication queue.
2018-06-20 13:35:15 3029 [Note] WSREP: Closing slave action queue.
2018-06-20 13:35:15 3029 [Note] WSREP: Service disconnected.
2018-06-20 13:35:15 3029 [Note] WSREP: rollbacker thread exiting
2018-06-20 13:35:16 3029 [Note] WSREP: Some threads may fail to exit.
2018-06-20 13:35:16 3029 [Note] Binlog end

2018-06-20 13:35:16 3029 [Note] InnoDB: FTS optimize thread exiting.
2018-06-20 13:35:16 3029 [Note] InnoDB: Starting shutdown…
11:35:16 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at [url]https://bugs.launchpad.net/percona-xtradb-cluster[/url]

key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=99
max_threads=302
thread_count=13
connection_count=5
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 136709 K bytes of memory
Hope that’s ok; if not, decrease some variables in the equation.

Thread pointer: 0xaa05730
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
stack_bottom = 7fcbaaf93cf0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0x911c0b]
/usr/sbin/mysqld(handle_fatal_signal+0x471)[0x67de01]
/usr/lib64/libpthread.so.0(+0xf370)[0x7fd87966a370]
/usr/sbin/mysqld[0x6a32ca]
/usr/sbin/mysqld[0x6a40f3]
/usr/sbin/mysqld[0x6a436e]
/usr/sbin/mysqld[0x690829]
/usr/sbin/mysqld[0x68fd1c]
/usr/sbin/mysqld(_Z16acl_authenticateP3THDj+0x1e8)[0x6a4568]
/usr/sbin/mysqld[0x6d4d73]
/usr/sbin/mysqld(_Z16login_connectionP3THD+0x54)[0x6d6924]
/usr/sbin/mysqld(_Z22thd_prepare_connectionP3THD+0x19)[0x6d73c9]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x153)[0x6d7573]
/usr/sbin/mysqld(handle_one_connection+0x40)[0x6d77d0]
/usr/sbin/mysqld(pfs_spawn_thread+0x146)[0x94a616]
/usr/lib64/libpthread.so.0(+0x7dc5)[0x7fd879662dc5]
/usr/lib64/libc.so.6(clone+0x6d)[0x7fd8778b173d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 1672210
Status: NOT_KILLED

You may download the Percona XtraDB Cluster operations manual by visiting
[url]http://www.percona.com/software/percona-xtradb-cluster/[/url]. You may find information
in the manual which will help you identify the cause of the crash.

Hello Krunal,

The node is crashing while calling the same SP. This SP creates couple of temp tables and finally does a bulk insert of records (bulk size can be between 1500 and 5000 rows). What does this error mean (replay of trx to fail) ? This SP gets fired very minute. I have asked the dev team to limit the bulk size to 1000 and loop it suspecting transaction size.

Regards,
Raghu

What does the error ‘Failed to reply trx’ means ? When we reduced the above mentioned batch size to 1000, its some what stable. I am not sure what this error means. Can anyone help me understand the error ?

This error indicates that the statement (in this case the SP) is being retried because PXC detected a conflict (updating the same rows, inserting rows with the same key, etc…).

The default value of wsrep_retry_autocommit is 1. So if an autocommit statement fails due to a galera conflict, then the statement is automatically retried (and this error is then produced if the retry fails).

Here’s some information from the galera docs
[url]http://galeracluster.com/documentation-webpages/dealingwithmultimasterconflicts.html[/url]

Kenn, Thank you for the info.

So, does this error makes the cluster node to go down ? The mentioned SP does delete if records are present for the insert to be performed based on the KEY (i,e it delete the records first then only inserts). So normally this KEY conflict should not happen.

In our set up we do write to all the nodes so, I suspect this SP got called milliseconds apart in two nodes. So, in first node delete happened followed by second node delete which got followed by node 1 insert and node 2 insert respectively. In this case the last insert (node 2 insert) will fail based on the KEY as data is already present. But, in normal MySQL if we try to do so the insert in such a case, it will always fail saying KEY violation, but wont make the database service to go down.

Is there any difference in PXC clusters behaviour in such cases ?

Actually, looking at the error, PXC is in a state it shouldn’t be in

WSREP: Invalid state in replay for trx source: 7ee807fc-73dc-11e8-8169-fa8168816846 version: 3 local: 1 state: EXECUTING flags: 0 conn_id: -1 trx_id: 2214966172 seqnos (l: -1, g: -1, s: -1, d: -1, ts: 1529477008413707629) (FATAL)

This then generates an error, however this is an unexpected error, so the node aborts to avoid possible data inconsistencies.

There was a bug with conflict handling in stored procedures that was fixed recently. I’m not sure if it fixes this specific issue but the fix is in PXC 5.6.40.

Does your stored procedure have a DECLARE…HANDLER statement?

The concerned stored procedure does not have DECLARE…HANDLER statement. Does having it do benefit for these kind of scenarios related to cluster ?