Analysing the cause of a replication related crash

Hello

Our 3-node Percona XtraDB 5.7.35 cluster has recently crashed with the following log messages.

First on both backup nodes:

2022-08-27T03:02:22.497450Z 1 
[ERROR] Slave SQL: Could not execute Write_rows event on table xxxx; 
Corrupted replication event was detected, Error_code: 1610; 
handler error No Error!; the event's master log FIRST, end_log_pos 37597, 
Error_code: 1610

2022-08-27T03:02:22.497508Z 1 
[Warning] WSREP: RBR event 8 Write_rows apply warning: 1610, 189869288

2022-08-27T03:02:22.777730Z 1 
[Warning] WSREP: Failed to apply app buffer: seqno: 189869288, status: 1

and right after that on the primary:

2022-08-27T03:02:23.004267Z 5563960 
[ERROR] WSREP: FSM: no such a transition COMMITTING -> ROLLED_BACK
03:02:24 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://jira.percona.com/projects/PXC/issues

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=110
max_threads=152
thread_count=12
connection_count=10
It is possible that mysqld could use up to 

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 68421 K
bytes of memory Hope that's ok; if not, decrease some variables in the equation.


Build ID: d071f84230e9b4d1b0abc1cc3b1ccb4c8ee252ab

Server Version: 5.7.35-38-57-log Percona XtraDB Cluster (GPL), Release rel38,
Revision 5c9f299, WSREP version 31.53, wsrep_31.53

Thread pointer: 0x7f631dcb4000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f66864fecf0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xf7b95b]
/usr/sbin/mysqld(handle_fatal_signal+0x505)[0xd8d245]
/lib64/libpthread.so.0(+0xf630)[0x7f676240e630]
/lib64/libc.so.6(gsignal+0x37)[0x7f676050b387]
/lib64/libc.so.6(abort+0x148)[0x7f676050ca78]
/usr/lib64/galera3/libgalera_smm.so(+0x1be001)[0x7f675d994001]
/usr/lib64/galera3/libgalera_smm.so(+0x1b1791)[0x7f675d987791]
/usr/lib64/galera3/libgalera_smm.so(+0x1c78c6)[0x7f675d99d8c6]
/usr/sbin/mysqld[0xd97889]
/usr/sbin/mysqld(_Z15ha_rollback_lowP3THDb+0xed)[0x7c034d]
/usr/sbin/mysqld(_ZN13MYSQL_BIN_LOG8rollbackEP3THDb+0x101)[0xf168b1]
/usr/sbin/mysqld(_Z17ha_rollback_transP3THDb+0x66)[0x7c0546]
/usr/sbin/mysqld(_Z15ha_commit_transP3THDbb+0x2fa)[0x7c0bba]
/usr/sbin/mysqld(_Z17trans_commit_stmtP3THD+0x3c)[0xd58dcc]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THDb+0x962)[0xca0632]
/usr/sbin/mysqld(_Z11mysql_parseP3THDP12Parser_stateb+0x505)[0xca8145]
/usr/sbin/mysqld[0xca82cd]
/usr/sbin/mysqld(_Z16dispatch_commandP3THDPK8COM_DATA19enum_server_command+0x2784)[0xcab3b4]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x172)[0xcabad2]
/usr/sbin/mysqld(handle_connection+0x2c8)[0xd70db8]
/usr/sbin/mysqld(pfs_spawn_thread+0x1b4)[0xf943d4]
/lib64/libpthread.so.0(+0x7ea5)[0x7f6762406ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f67605d3b0d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f631dced030): is an invalid pointer
Connection ID (thread ID): 5563960
Status: NOT_KILLED

we have a hard time analysing the cause of the crash, because it is unclear what exactly went wrong.

  • What’s “error code 1610”?
  • What does “no such a transition COMMITTING → ROLLED_BACK” mean?
  • What could have lead to these two conditions?
  • What could we investigate further to determine the cause?
1 Like

From mysql manual:

  • Error number: 1610; Symbol: ER_SLAVE_CORRUPT_EVENT; SQLSTATE: HY000 Message: Corrupted replication event was detected

Are you using store routines/triggers/prepared statements? You have a rather generic error message, so it is difficult to answer your remaining questions. Can you provide the exact steps that lead to the crash? Is it repeatable?

1 Like

Thanks for your reaction, Matthew!

This is the first time we have encountered these two particular conditions, the table mentioned in the first error message and the timestamp point to a process that happens every day at that time including this morning after the cluster has been restored. So it doesn’t seem easily repeatable, that one query alone isn’t enough to trigger the problem.

Could we have used the information in the error message ( the event's master log FIRST, end_log_pos 37597,) to investigate the exact events that lead to the crash? I realize that it is probably too late now, but is there anything we could have done prior to restoring the cluster in order to investigate the crash later?

1 Like

Yes to prepared statements: our products use a database framework that uses prepared statements for all queries.

1 Like

Without knowing more specifics, it sounds like you might have been the victim of bad timing where several things happened at the same time causing this issue. Those “several things” being the trigger and the prepared statement within the procedure.

Unfortunately without a reproducible test case, our engineers won’t be able to do much in terms of isolating the cause of the crash.

1 Like