Percona cluster node won't join after upgrade

I made an upgrade from 8.0.36 to 8.0.40. I did a rolling upgrade, upgraded one node, restarted mysql, rinse and repeat.

On the last node, I shut it down with systemctl stop mysqld. Shutdown was graceful:

2025-02-18T11:56:40.356460Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.36-28.1).
2025-02-18T11:56:40.356499Z 0 [Note] [MY-000000] [WSREP] Received shutdown signal. Will sleep for 10 secs before initiating shutdown. pxc_maint_mode switched to SHUTDOWN
2025-02-18T11:56:50.363267Z 0 [Note] [MY-000000] [WSREP] Shutdown replication
2025-02-18T11:56:54.215865Z 0 [Note] [MY-013072] [InnoDB] Starting shutdown...
2025-02-18T11:56:57.756889Z 0 [Note] [MY-012980] [InnoDB] Shutdown completed; log sequence number 11051296464304
2025-02-18T11:56:57.759992Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.36-28.1)  Percona XtraDB Cluster (GPL), Release rel28, Revision bfb687f, WSREP version 26.1.4.3.

I then upgraded all percona packages and started the node. Because it was offline a little longer, it needed to do SST. All went smoothly, but after node join and start of queue processing it crashed.

2025-02-18T12:06:11.553625Z 3 [System] [MY-000000] [WSREP] SST completed
....
2025-02-18T12:06:11.555880Z 0 [Note] [MY-000000] [Galera] 1.0 (mysql-db1a): State transfer from 0.0 (mysql-db1c) complete.
2025-02-18T12:06:11.555894Z 0 [Note] [MY-000000] [Galera] SST leaving flow control
2025-02-18T12:06:11.555900Z 0 [Note] [MY-000000] [Galera] Shifting JOINER -> JOINED (TO: 810374888)
2025-02-18T12:06:11.555926Z 0 [Note] [MY-000000] [Galera] Processing event queue:... 0.0% (0/30965 events) complete.
...
025-02-18T12:06:11.556853Z 0 [Note] [MY-000000] [Galera] LAST message 00000000-0000-0000-0000-000000000000:810344469,0 from another group (00000000-0000-0000-0000-000000000000). Dropping message.
2025-02-18T12:06:11.557589Z 0 [Note] [MY-000000] [Galera] LAST message 00000000-0000-0000-0000-000000000000:810344597,0 from another group (00000000-0000-0000-0000-000000000000). Dropping message.
2025-02-18T12:06:11.558310Z 0 [Note] [MY-000000] [Galera] LAST message 00000000-0000-0000-0000-000000000000:810344725,0 from another group (00000000-0000-0000-0000-000000000000). Dropping message.
2025-02-18T12:06:11.559092Z 0 [Note] [MY-000000] [Galera] LAST message 00000000-0000-0000-0000-000000000000:810344853,0 from another group (00000000-0000-0000-0000-000000000000). Dropping message.
2025-02-18T12:06:13.066326Z 0 [Note] [MY-000000] [Galera] Processing event queue:... 100.0% (31096/31096 events) complete.
2025-02-18T12:06:13.066336Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 810375016)
...
2025-02-18T12:06:13.142489Z 0 [Note] [MY-000000] [Galera] Deleted page /var/lib/mysql/gcache.page.000005
2025-02-18T12:06:13.152119Z 16 [Note] [MY-000000] [Galera] Server mysql-db1a synced with group
2025-02-18T12:06:13.152141Z 16 [Note] [MY-000000] [WSREP] Server status change joined -> synced
2025-02-18T12:06:13.152147Z 16 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2025-02-18T12:06:13.152167Z 16 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
...
2025-02-18T12:06:19.485990Z 37 [ERROR] [MY-000000] [Galera] Unknown writeset version: -1
2025-02-18T12:06:19.486026Z 37 [Note] [MY-000000] [WSREP] Initiating SST cancellation
2025-02-18T12:06:19Z UTC - mysqld got signal 6 ;
Thread pointer: 0x7fa25c0bf370
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fb4a67f4bd0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x16dce41]
/usr/sbin/mysqld(print_fatal_signal(int)+0x3cc) [0xe3e0bc]
/usr/sbin/mysqld(handle_fatal_signal+0xe7) [0xe3e2e7]
/lib64/libc.so.6(+0x3e6f0) [0x7fb4be43e6f0]
/lib64/libc.so.6(+0x8b94c) [0x7fb4be48b94c]
/lib64/libc.so.6(raise+0x16) [0x7fb4be43e646]
/lib64/libc.so.6(abort+0xd3) [0x7fb4be4287f3]
/usr/lib64/galera4/libgalera_smm.so(+0x195ec) [0x7fb4b1c195ec]
/usr/lib64/galera4/libgalera_smm.so(+0x1981cd) [0x7fb4b1d981cd]
/usr/lib64/galera4/libgalera_smm.so(+0x54721) [0x7fb4b1c54721]
/usr/sbin/mysqld(wsrep::wsrep_provider_v26::append_key(wsrep::ws_handle&, wsrep::key const&)+0xaf) [0x1ec606f]
/usr/sbin/mysqld(wsrep::transaction::append_key(wsrep::key const&)+0x73) [0x1e6f7d3]
/usr/sbin/mysqld(wsrep_thd_append_key+0x105) [0xe87465]
/usr/sbin/mysqld() [0x234ac62]
/usr/sbin/mysqld(ha_innobase::wsrep_append_keys(THD*, Wsrep_service_key_type, unsigned char const*, unsigned char const*)+0x3dd) [0x1726ffd]
/usr/sbin/mysqld(ha_innobase::update_row(unsigned char const*, unsigned char*)+0x2b8) [0x1728518]
/usr/sbin/mysqld(handler::ha_update_row(unsigned char const*, unsigned char*)+0x1a6) [0x9f02e6]
/usr/sbin/mysqld(Sql_cmd_update::update_single_table(THD*)+0x1d3c) [0xdc09dc]
/usr/sbin/mysqld(Sql_cmd_dml::execute(THD*)+0x672) [0xd2b162]
/usr/sbin/mysqld(mysql_execute_command(THD*, bool)+0xf60) [0xcfc170]
/usr/sbin/mysqld(Prepared_statement::execute(THD*, String*, bool)+0x597) [0xd1daa7]
/usr/sbin/mysqld(Prepared_statement::execute_loop(THD*, String*, bool)+0x149) [0xd2e6f9]
/usr/sbin/mysqld(mysqld_stmt_execute(THD*, Prepared_statement*, bool, unsigned long, PS_PARAM*)+0x145) [0xd2ee25]
/usr/sbin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0xd87) [0xd02467]
/usr/sbin/mysqld(do_command(THD*)+0x32e) [0xcd0bbe]
/usr/sbin/mysqld() [0xe3f338]
/usr/sbin/mysqld() [0x1a7fe96]
/lib64/libc.so.6(+0x89c02) [0x7fb4be489c02]
/lib64/libc.so.6(+0x10ec40) [0x7fb4be50ec40]

Logs on the donor:

2025-02-18T12:05:55.771760Z 0 [Note] [MY-000000] [Galera] 0.0 (mysql--db1c): State transfer to 1.0 (mysql--db1a) complete.
2025-02-18T12:05:55.771786Z 0 [Note] [MY-000000] [Galera] Shifting DONOR/DESYNCED -> JOINED (TO: 810371905)
2025-02-18T12:05:55.771822Z 0 [Note] [MY-000000] [Galera] Processing event queue:... -nan% (0/0 events) complete.
2025-02-18T12:05:55.772000Z 0 [Note] [MY-000000] [Galera] Member 0.0 (mysql--db1c) synced with group.
2025-02-18T12:05:55.772017Z 0 [Note] [MY-000000] [Galera] Processing event queue:... 100.0% (1/1 events) complete.
2025-02-18T12:05:55.772022Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 810371905)
2025-02-18T12:05:55.772050Z 2 [Note] [MY-000000] [Galera] Server mysql--db1c synced with group
2025-02-18T12:05:55.772063Z 2 [Note] [MY-000000] [WSREP] Server status change joined -> synced
2025-02-18T12:05:55.772069Z 2 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2025-02-18T12:05:55.772073Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2025-02-18T12:06:06.597235Z 0 [Note] [MY-000000] [Galera] Created page /var/lib/mysql/gcache.page.000005 of size 134217728 bytes
2025-02-18T12:06:11.554435Z 0 [ERROR] [MY-000000] [Galera] async IST sender failed to serve ssl://xxx.yyy.14.100:4568: ist send failed: ', asio error 'Unrecognized writeset version: -1
	 at ../../../../percona-xtradb-cluster-galera/galera/src/write_set_ng.hpp:version():98'
	 at ../../../../percona-xtradb-cluster-galera/galera/src/ist.cpp:send():1003
2025-02-18T12:06:11.554520Z 0 [Note] [MY-000000] [Galera] IST sender finished. Unlocking GCache.
2025-02-18T12:06:11.554669Z 0 [Note] [MY-000000] [Galera] async IST sender served
2025-02-18T12:06:11.555549Z 0 [Note] [MY-000000] [Galera] 1.0 (mysql--db1a): State transfer from 0.0 (mysql--db1c) complete.
...
Lots of LAST message 00000000-0000-0000-0000-000000000000:810374457,0
...
2025-02-18T12:06:19.754224Z 0 [Note] [MY-000000] [Galera] (4c4fa339-84ee, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://xxx.yyy.14.100:4567
2025-02-18T12:06:21.052376Z 0 [Note] [MY-000000] [Galera] (4c4fa339-84ee, 'ssl://0.0.0.0:4567') reconnecting to 54a63c1d-aa0d (ssl://xxx.yyy.14.100:4567), attempt 0
2025-02-18T12:06:21.052728Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-02-18T12:06:22.552719Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-02-18T12:06:24.052851Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: Connection refused
2025-02-18T12:06:24.498389Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)
2025-02-18T12:06:24.498500Z 0 [Note] [MY-000000] [Galera] evs::proto(4c4fa339-84ee, OPERATIONAL, view_id(REG,4c4fa339-84ee,52)) suspecting node: 54a63c1d-aa0d
2025-02-18T12:06:24.498513Z 0 [Note] [MY-000000] [Galera] evs::proto(4c4fa339-84ee, OPERATIONAL, view_id(REG,4c4fa339-84ee,52)) suspected node without join message, declaring inactive
2025-02-18T12:06:24.998587Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 inactive (evs.inactive_timeout)
2025-02-18T12:06:25.498746Z 0 [Note] [MY-000000] [Galera] declaring e1becf57-87b9 at ssl://xxx.yyy.14.101:4567 stable
2025-02-18T12:06:25.498891Z 0 [Note] [MY-000000] [Galera] Node 4c4fa339-84ee state primary

and after some time

2025-02-18T12:18:07.110204Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: wrong version number
2025-02-18T12:21:01.641345Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: wrong version number
2025-02-18T12:21:14.846499Z 0 [Note] [MY-000000] [Galera] Failed to establish connection: wrong version number

I search google and this forum for any idea, but can’t find anything. I did the same upgrade in our testing environment without a problem.

I also search for error messages on google but came up empty: Unrecognized writeset version: -1. What I did notice is that seq number on the joining node in grastate.dat is -1.

Any idea how to get my cluster back without full restart?

I deleted grastate.dat and gvwstate.dat. After that, I restarted mysqld on db1a, SST finished and node booted up successfully.

The next time you need to do this, go to another node and set freeze_purge_at_seqno to ‘now’. This will temporarily prevent gcache from being purged while the node is down allowing for IST. Once the node has recovered, be sure to set back to -1 to resume normal gcache purging.

Thank for the hint. This will come in handy on the next upgrade:)

For completeness here is a blogpost showing how IST works: Understanding how an IST donor is selected

As previously mentioned, freeze_purge_at_seqno is the way to ensure that IST will be possible but keep in mind that while above variable is enabled, gcache won’t be removed. If long enough time happens (or if you forgot to disable it afterwards) then the disk might get full

All in all, you should avoid restarting nodes while there is high write activity , adjust gcache.size or even using freeze_purge_at_seqno

Regards