Percona XtraDB sometimes crashes during shutdown

Over the last couple of weeks, we sometimes experience crashes during node shutdown (ie reboot of the server). This is our setup:

  • Debian 12
  • percona-xtradb-cluster-server package version 1:8.0.37-29-1.bookworm
  • 2 Data Nodes
  • 1 garbd Node

This is in mysql.log from yesterdays shutdown:

2025-01-27T11:41:19.781740Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.37-29.1).
2025-01-27T11:41:19.781795Z 0 [Note] [MY-000000] [WSREP] Received shutdown signal. Will sleep for 10 secs before initiating shutdown. pxc_maint_mode switched to SHUTDOWN
2025-01-27T11:41:29.785339Z 0 [Note] [MY-000000] [WSREP] Shutdown replication
2025-01-27T11:41:29.785857Z 0 [Note] [MY-000000] [WSREP] Server status change synced -> disconnecting
2025-01-27T11:41:29.785885Z 0 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2025-01-27T11:41:29.786258Z 638198 [ERROR] [MY-000000] [Galera] FSM: no such a transition APPLYING -> COMMITTED
2025-01-27T11:41:29.786274Z 638213 [ERROR] [MY-000000] [Galera] FSM: no such a transition APPLYING -> COMMITTED
2025-01-27T11:41:29.786313Z 638198 [Note] [MY-000000] [WSREP] Initiating SST cancellation
2025-01-27T11:41:29Z UTC - mysqld got signal 6 ;
2025-01-27T11:41:29.786362Z 0 [Note] [MY-000000] [Galera] Closing send monitor...

Neither the other data node nor the garbd node log anything in that time.

After the reboot, mysql is not started automatically. This is the journal:

Jan 27 12:41:41 <hostname> systemd[1]: Starting mysql.service - Percona XtraDB Cluster...
Jan 27 12:41:41 <hostname> mysql-systemd[808]:  WARNING: Node has been rebooted, /data/mysql/data//grastate.dat: seqno = -1, mysql service has not>
Jan 27 12:41:41 <hostname> systemd[1]: mysql.service: Control process exited, code=exited, status=1/FAILURE
Jan 27 12:41:41 <hostname> mysql-systemd[853]:  WARNING: mysql pid file /var/run/mysqld/mysqld.pid empty or not readable
Jan 27 12:41:41 <hostname> mysql-systemd[853]:  WARNING: mysql may be already dead
Jan 27 12:41:41 <hostname> systemd[1]: mysql.service: Failed with result 'exit-code'.
Jan 27 12:41:41 <hostname> systemd[1]: Failed to start mysql.service - Percona XtraDB Cluster.

As far as I understand, the -1 in the grastate file is a result of the crash during shutdown and that it is correct to not just start mysql in that state. However, after 5 minutes, the service can just be started via systemctl start mysql. It starts up just fine, joins the cluster and everything becomes healthy.

Can someone shed some light on what might be going on here?

We already made sure that mysql is given enough time to stop (we increased the stop timeout in the systemd unit), but we do not see the shutdown take as long as the shutdown timeout. So this does not seem to be the issue.

We also increased redo log size, since, during earlier failures of this kind, we noticed log lines like

[MY-013865] [InnoDB] Redo log writer is waiting for a new redo log file. Consider increasing innodb_redo_log_capacity.

The increase made the redo log messages go away, but the crashes during shutdown keep happening.

Let me know if you need any more information, I’ll supply them as soon as possible.

Hi Dominik_Klein,

With the information provided is not possible to know what might be happening.

Considering that you are on 8.0.37 can you consider upgrading to latest pxc 8.0.40 and check if that fixes the problem?

You might also need to check dmesg in case there are any other system errors.

Regards

Hi CTutte

I upgraded the cluster to 8.0.40 and will report back if I experience the problem again. Please let me know what kind of information you would need to debug this further.

Regards

Hi again

the update did not change anything for me. The error happened again.

I am able to reproduce the problem somewhat reliably now (doesn’t occur 100% of the time, but about every other time). It seems to happen if the server is rebooted under load.

My setup to reproduce is to run sysbench towards data node 1 and then, while sysbench is still running, reboot the machine.

sysbench --tables=100 --table_size=100000 --time=600 --mysql-user=root --mysql-db=sysbench --mysql-ignore-errors=all oltp_write_only run

This reproduction gives more log. While the production system log ends at

Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...

the test system does manage to produce the backtrace. Here it is:

2025-02-05T09:38:32.499845Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.40-31.1).
2025-02-05T09:38:32.499883Z 0 [Note] [MY-000000] [WSREP] Received shutdown signal. Will sleep for 10 secs before initiating shutdown. pxc_maint_mode switched to SHUTDOWN
2025-02-05T09:38:42.507635Z 0 [Note] [MY-000000] [WSREP] Shutdown replication
2025-02-05T09:38:42.508228Z 0 [Note] [MY-000000] [WSREP] Server status change synced -> disconnecting
2025-02-05T09:38:42.508388Z 0 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2025-02-05T09:38:42.508922Z 0 [Note] [MY-000000] [Galera] Closing send monitor...
2025-02-05T09:38:42.509037Z 0 [Note] [MY-000000] [Galera] Closed send monitor.
2025-02-05T09:38:42.509226Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread
2025-02-05T09:38:42.509248Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread
2025-02-05T09:38:42.509873Z 0 [Note] [MY-000000] [Galera] gcomm: closing backend
2025-02-05T09:38:42.511036Z 553 [ERROR] [MY-000000] [Galera] FSM: no such a transition APPLYING -> COMMITTED
2025-02-05T09:38:42.511080Z 553 [Note] [MY-000000] [WSREP] Initiating SST cancellation
2025-02-05T09:38:42Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
BuildID[sha1]=924bf8212e79f806f39a524ce1be6bbbbee2b634
Server Version: 8.0.40-31.1 Percona XtraDB Cluster (GPL), Release rel31, Revision 4b32153, WSREP version 26.1.4.3, wsrep_26.1.4.3

Thread pointer: 0x7f2a441c85d0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
2025-02-05T09:38:42.511987Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,989bf1a3-8ada,139)
memb {
        989bf1a3-8ada,0
        }
joined {
        }
left {
        }
partitioned {
        9a535999-ab28,0
        d99e331e-be18,0
        }
)
2025-02-05T09:38:42.512211Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2025-02-05T09:38:42.512224Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2025-02-05T09:38:42.512553Z 0 [Note] [MY-000000] [Galera] gcomm: closed
2025-02-05T09:38:42.512624Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2025-02-05T09:38:42.512836Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [100, 100]
2025-02-05T09:38:42.512847Z 0 [Note] [MY-000000] [Galera] Received NON-PRIMARY.
2025-02-05T09:38:42.512854Z 0 [Note] [MY-000000] [Galera] Shifting SYNCED -> OPEN (TO: 7867188)
2025-02-05T09:38:42.512864Z 0 [Note] [MY-000000] [Galera] New SELF-LEAVE.
2025-02-05T09:38:42.512876Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [0, 0]
2025-02-05T09:38:42.512881Z 0 [Note] [MY-000000] [Galera] Received SELF-LEAVE. Closing connection.
2025-02-05T09:38:42.512887Z 0 [Note] [MY-000000] [Galera] Shifting OPEN -> CLOSED (TO: 7867188)
2025-02-05T09:38:42.512897Z 0 [Note] [MY-000000] [Galera] RECV thread exiting 0: Success
2025-02-05T09:38:42.513248Z 0 [Note] [MY-000000] [Galera] recv_thread() joined.
2025-02-05T09:38:42.513271Z 0 [Note] [MY-000000] [Galera] Closing send queue.
2025-02-05T09:38:42.513277Z 0 [Note] [MY-000000] [Galera] Closing receive queue.
2025-02-05T09:38:42.513400Z 11 [Note] [MY-000000] [Galera] Maybe drain monitors from 7867187 upto current CC event 7867188 upto:7867188
2025-02-05T09:38:42.513412Z 11 [Note] [MY-000000] [Galera] Drain monitors from 7867187 up to 7867188
stack_bottom = 7f2a31ffabc0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x56213043fa9e]
/usr/sbin/mysqld(print_fatal_signal(int)+0x29a) [0x56212f526a5a]
/usr/sbin/mysqld(handle_fatal_signal+0xd7) [0x56212f526c27]
/lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x7f2b34a5b050]
/lib/x86_64-linux-gnu/libc.so.6(+0x8aebc) [0x7f2b34aa9ebc]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x12) [0x7f2b34a5afb2]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3) [0x7f2b34a45472]
/usr/lib/galera4/libgalera_smm.so(+0x54fc2) [0x7f2b28054fc2]
/usr/lib/galera4/libgalera_smm.so(+0x71ffd) [0x7f2b28071ffd]
/usr/lib/galera4/libgalera_smm.so(+0x57215) [0x7f2b28057215]
/usr/sbin/mysqld(wsrep::wsrep_provider_v26::commit_order_leave(wsrep::ws_handle const&, wsrep::ws_meta const&, wsrep::mutable_buffer const&)+0x6d) [0x562130d760ad]
/usr/sbin/mysqld(wsrep::transaction::ordered_commit()+0xb9) [0x562130d6e439]
/usr/sbin/mysqld(trx_commit_complete_for_mysql(trx_t*)+0x4e) [0x56213066fcfe]
/usr/sbin/mysqld(+0x233a583) [0x5621304a6583]
/usr/sbin/mysqld(ha_commit_low(THD*, bool, bool)+0x496) [0x56212f0a0306]
/usr/sbin/mysqld(trx_coordinator::commit_in_engines(THD*, bool, bool)+0x2d) [0x56212f4d304d]
/usr/sbin/mysqld(wsrep_thd_binlog_commit(THD*, bool)+0x4c) [0x56213009695c]
/usr/sbin/mysqld(ha_commit_trans(THD*, bool, bool)+0x5b9) [0x56212f0a1d39]
/usr/sbin/mysqld(trans_commit(THD*, bool)+0x4b) [0x56212f4d622b]
/usr/sbin/mysqld(mysql_execute_command(THD*, bool)+0x2903) [0x56212f3a7423]
/usr/sbin/mysqld(Prepared_statement::execute(THD*, String*, bool)+0x64d) [0x56212f3d9ead]
/usr/sbin/mysqld(Prepared_statement::execute_loop(THD*, String*, bool)+0x125) [0x56212f3de775]
/usr/sbin/mysqld(mysqld_stmt_execute(THD*, Prepared_statement*, bool, unsigned long, PS_PARAM*)+0x1c4) [0x56212f3dee64]
/usr/sbin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x22a6) [0x56212f3ae046]
/usr/sbin/mysqld(do_command(THD*)+0x297) [0x56212f3afc57]
/usr/sbin/mysqld(+0x13ab080) [0x56212f517080]
/usr/sbin/mysqld(+0x2735f40) [0x5621308a1f40]
/lib/x86_64-linux-gnu/libc.so.6(+0x891c4) [0x7f2b34aa81c4]
/lib/x86_64-linux-gnu/libc.so.6(+0x10985c) [0x7f2b34b2885c]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f2a443e56e8): COMMIT
Connection ID (thread ID): 553
Status: NOT_KILLED

You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.

Any idea on what to do next?

Thanks
Dominik