Percona cluster node won't join after upgrade

I made an upgrade from 8.0.36 to 8.0.40. I did a rolling upgrade, upgraded one node, restarted mysql, rinse and repeat.

On the last node, I shut it down with systemctl stop mysqld. Shutdown was graceful:

2025-02-18T11:56:40.356460Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.36-28.1).
2025-02-18T11:56:40.356499Z 0 [Note] [MY-000000] [WSREP] Received shutdown signal. Will sleep for 10 secs before initiating shutdown. pxc_maint_mode switched to SHUTDOWN
2025-02-18T11:56:50.363267Z 0 [Note] [MY-000000] [WSREP] Shutdown replication
2025-02-18T11:56:54.215865Z 0 [Note] [MY-013072] [InnoDB] Starting shutdown...
2025-02-18T11:56:57.756889Z 0 [Note] [MY-012980] [InnoDB] Shutdown completed; log sequence number 11051296464304
2025-02-18T11:56:57.759992Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.36-28.1)  Percona XtraDB Cluster (GPL), Release rel28, Revision bfb687f, WSREP version 26.1.4.3.

I then upgraded all percona packages and started the node. Because it was offline a little longer, it needed to do SST. All went smoothly, but after node join and start of queue processing it crashed.

2025-02-18T12:06:11.553625Z 3 [System] [MY-000000] [WSREP] SST completed
....
2025-02-18T12:06:11.555880Z 0 [Note] [MY-000000] [Galera] 1.0 (mysql-db1a): State transfer from 0.0 (mysql-db1c) complete.
2025-02-18T12:06:11.555894Z 0 [Note] [MY-000000] [Galera] SST leaving flow control
2025-02-18T12:06:11.555900Z 0 [Note] [MY-000000] [Galera] Shifting JOINER -> JOINED (TO: 810374888)
2025-02-18T12:06:11.555926Z 0 [Note] [MY-000000] [Galera] Processing event queue:... 0.0% (0/30965 events) complete.
...
025-02-18T12:06:11.556853Z 0 [Note] [MY-000000] [Galera] LAST message 00000000-0000-0000-0000-000000000000:810344469,0 from another group (00000000-0000-0000-0000-000000000000). Dropping message.
2025-02-18T12:06:11.557589Z 0 [Note] [MY-000000] [Galera] LAST message 00000000-0000-0000-0000-000000000000:810344597,0 from another group (00000000-0000-0000-0000-000000000000). Dropping message.
2025-02-18T12:06:11.558310Z 0 [Note] [MY-000000] [Galera] LAST message 00000000-0000-0000-0000-000000000000:810344725,0 from another group (00000000-0000-0000-0000-000000000000). Dropping message.
2025-02-18T12:06:11.559092Z 0 [Note] [MY-000000] [Galera] LAST message 00000000-0000-0000-0000-000000000000:810344853,0 from another group (00000000-0000-0000-0000-000000000000). Dropping message.
2025-02-18T12:06:13.066326Z 0 [Note] [MY-000000] [Galera] Processing event queue:... 100.0% (31096/31096 events) complete.
2025-02-18T12:06:13.066336Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 810375016)
...
2025-02-18T12:06:13.142489Z 0 [Note] [MY-000000] [Galera] Deleted page /var/lib/mysql/gcache.page.000005
2025-02-18T12:06:13.152119Z 16 [Note] [MY-000000] [Galera] Server mysql-db1a synced with group
2025-02-18T12:06:13.152141Z 16 [Note] [MY-000000] [WSREP] Server status change joined -> synced
2025-02-18T12:06:13.152147Z 16 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2025-02-18T12:06:13.152167Z 16 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
...
2025-02-18T12:06:19.485990Z 37 [ERROR] [MY-000000] [Galera] Unknown writeset version: -1
2025-02-18T12:06:19.486026Z 37 [Note] [MY-000000] [WSREP] Initiating SST cancellation
2025-02-18T12:06:19Z UTC - mysqld got signal 6 ;
Thread pointer: 0x7fa25c0bf370
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fb4a67f4bd0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x16dce41]
/usr/sbin/mysqld(print_fatal_signal(int)+0x3cc) [0xe3e0bc]
/usr/sbin/mysqld(handle_fatal_signal+0xe7) [0xe3e2e7]
/lib64/libc.so.6(+0x3e6f0) [0x7fb4be43e6f0]
/lib64/libc.so.6(+0x8b94c) [0x7fb4be48b94c]
/lib64/libc.so.6(raise+0x16) [0x7fb4be43e646]
/lib64/libc.so.6(abort+0xd3) [0x7fb4be4287f3]
/usr/lib64/galera4/libgalera_smm.so(+0x195ec) [0x7fb4b1c195ec]
/usr/lib64/galera4/libgalera_smm.so(+0x1981cd) [0x7fb4b1d981cd]
/usr/lib64/galera4/libgalera_smm.so(+0x54721) [0x7fb4b1c54721]
/usr/sbin/mysqld(wsrep::wsrep_provider_v26::append_key(wsrep::ws_handle&, wsrep::key const&)+0xaf) [0x1ec606f]
/usr/sbin/mysqld(wsrep::transaction::append_key(wsrep::key const&)+0x73) [0x1e6f7d3]
/usr/sbin/mysqld(wsrep_thd_append_key+0x105) [0xe87465]
/usr/sbin/mysqld() [0x234ac62]
/usr/sbin/mysqld(ha_innobase::wsrep_append_keys(THD*, Wsrep_service_key_type, unsigned char const*, unsigned char const*)+0x3dd) [0x1726ffd]
/usr/sbin/mysqld(ha_innobase::update_row(unsigned char const*, unsigned char*)+0x2b8) [0x1728518]
/usr/sbin/mysqld(handler::ha_update_row(unsigned char const*, unsigned char*)+0x1a6) [0x9f02e6]
/usr/sbin/mysqld(Sql_cmd_update::update_single_table(THD*)+0x1d3c) [0xdc09dc]
/usr/sbin/mysqld(Sql_cmd_dml::execute(THD*)+0x672) [0xd2b162]
/usr/sbin/mysqld(mysql_execute_command(THD*, bool)+0xf60) [0xcfc170]
/usr/sbin/mysqld(Prepared_statement::execute(THD*, String*, bool)+0x597) [0xd1daa7]
/usr/sbin/mysqld(Prepared_statement::execute_loop(THD*, String*, bool)+0x149) [0xd2e6f9]
/usr/sbin/mysqld(mysqld_stmt_execute(THD*, Prepared_statement*, bool, unsigned long, PS_PARAM*)+0x145) [0xd2ee25]
/usr/sbin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0xd87) [0xd02467]
/usr/sbin/mysqld(do_command(THD*)+0x32e) [0xcd0bbe]
/usr/sbin/mysqld() [0xe3f338]
/usr/sbin/mysqld() [0x1a7fe96]
/lib64/libc.so.6(+0x89c02) [0x7fb4be489c02]
/lib64/libc.so.6(+0x10ec40) [0x7fb4be50ec40]

Logs on the donor:

2025-02-18T12:05:55.771760Z 0 [Note] [MY-000000] [Galera] 0.0 (mysql--db1c): State transfer to 1.0 (mysql--db1a) complete.
2025-02-18T12:05:55.771786Z 0 [Note] [MY-000000] [Galera] Shifting DONOR/DESYNCED -> JOINED (TO: 810371905)
2025-02-18T12:05:55.771822Z 0 [Note] [MY-000000] [Galera] Processing event queue:... -nan% (0/0 events) complete.
2025-02-18T12:05:55.772000Z 0 [Note] [MY-000000] [Galera] Member 0.0 (mysql--db1c) synced with group.
2025-02-18T12:05:55.772017Z 0 [Note] [MY-000000] [Galera] Processing event queue:... 100.0% (1/1 events) complete.
2025-02-18T12:05:55.772022Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 810371905)
2025-02-18T12:05:55.772050Z 2 [Note] [MY-000000] [Galera] Server mysql--db1c synced with group
2025-02-18T12:05:55.772063Z 2 [Note] [MY-000000] [WSREP] Server status change joined -> synced
2025-02-18T12:05:55.772069Z 2 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2025-02-18T12:05:55.772073Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2025-02-18T12:06:06.597235Z 0 [Note] [MY-000000] [Galera] Created page /var/lib/mysql/gcache.page.000005 of size 134217728 bytes
2025-02-18T12:06:11.554435Z 0 [ERROR] [MY-000000] [Galera] async IST sender failed to serve ssl://xxx.yyy.14.100:4568: ist send failed: ', asio error 'Unrecognized writeset version: -1
	 at ../../../../percona-xtradb-cluster-galera/galera/src/write_set_ng.hpp:version():98'
	 at ../../../../percona-xtradb-cluster-galera/galera/src/ist.cpp:send():1003
2025-02-18T12:06:11.554520Z 0 [Note] [MY-000000] [Galera] IST sender finished. Unlocking GCache.
2025-02-18T12:06:11.554669Z 0 [Note] [MY-000000] [Galera] async IST sender served
2025-02-18T12:06:11.555549Z 0 [Note] [MY-000000] [Galera] 1.0 (mysql--db1a): State transfer from 0.0 (mysql--db1c) complete.
...
Lots of LAST message 00000000-0000-0000-0000-000000000000:810374457,0
...
2025-02-18T12:06:19.754224Z 0 [Note] [MY-000000] [Galera] (4c4fa339-84ee, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://xxx.yyy.14.100:4567
2025-02-18T12:06:21.052376Z 0 [Note] [MY-000000] [Galera] (4c4fa339-84ee, 'ssl://0.0.0.0:4567') reconnecting to 54a63c1d-aa0d (ssl://xxx.yyy.14.100:4567), attempt 0
2025-02-18T12:06:21.052728Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-02-18T12:06:22.552719Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-02-18T12:06:24.052851Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-02-18T12:06:24.498389Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)
2025-02-18T12:06:24.498500Z 0 [Note] [MY-000000] [Galera] evs::proto(4c4fa339-84ee, OPERATIONAL, view_id(REG,4c4fa339-84ee,52)) suspecting node: 54a63c1d-aa0d
2025-02-18T12:06:24.498513Z 0 [Note] [MY-000000] [Galera] evs::proto(4c4fa339-84ee, OPERATIONAL, view_id(REG,4c4fa339-84ee,52)) suspected node without join message, declaring inactive
2025-02-18T12:06:24.998587Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 inactive (evs.inactive_timeout)
2025-02-18T12:06:25.498746Z 0 [Note] [MY-000000] [Galera] declaring e1becf57-87b9 at ssl://xxx.yyy.14.101:4567 stable
2025-02-18T12:06:25.498891Z 0 [Note] [MY-000000] [Galera] Node 4c4fa339-84ee state primary

and after some time

2025-02-18T12:18:07.110204Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: wrong version number
2025-02-18T12:21:01.641345Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: wrong version number
2025-02-18T12:21:14.846499Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: wrong version number

I search google and this forum for any idea, but can’t find anything. I did the same upgrade in our testing environment without a problem.

I also search for error messages on google but came up empty: Unrecognized writeset version: -1. What I did notice is that seq number on the joining node in grastate.dat is -1.

Any idea how to get my cluster back without full restart?

I deleted grastate.dat and gvwstate.dat. After that, I restarted mysqld on db1a, SST finished and node booted up successfully.

The next time you need to do this, go to another node and set freeze_purge_at_seqno to ‘now’. This will temporarily prevent gcache from being purged while the node is down allowing for IST. Once the node has recovered, be sure to set back to -1 to resume normal gcache purging.

Thank for the hint. This will come in handy on the next upgrade:)