Galera node fails to join because SST failure due to redo log created with older backup

rantoie · February 23, 2021, 8:44pm

I have a three node galera cluster. I upgraded one node yesterday from 10.3 (Debian buster) galera 3, to 10.5 (Debian Bullseye) galera 4. Things went fine, everything worked.

Today I had to re-create the cluster, and re-do SST for the nodes from the 10.3 as the main donor for the others. The second node, running 10.3 joined and did SST just fine, the one I upgraded to 10.5 is unable to join because once it finishes its SST it complains:

InnoDB: Upgrade after a crash is not supported. The redo log was created with Backup 10.3.27-MariaDB.

and then aborts.

I’ve tried to remove entirely the data directory, and any logs, and re-do the state transfer from scratch, but the same problem happens.

Here is a complete log:

Feb 23 12:13:30 pochard mariadbd[528012]: 2021-02-23 12:13:30 1 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> c962afef-9de9-11ea-a9b2-2fa479531940:65036604
Feb 23 12:13:32 pochard mariadbd[528012]: 2021-02-23 12:13:32 0 [Note] WSREP: (980beb0c-b112, 'tcp://0.0.0.0:4567') turning message relay requesting off
Feb 23 12:14:11 pochard mariadbd[528012]: 2021-02-23 12:14:11 0 [Note] WSREP: 0.0 (scaup): State transfer to 2.0 (pochard) complete.
Feb 23 12:14:11 pochard mariadbd[528012]: 2021-02-23 12:14:11 0 [Note] WSREP: Member 0.0 (scaup) synced with group.
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 3 [Note] WSREP: SST received
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 3 [Note] WSREP: Server status change joiner -> initializing
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Note] InnoDB: Using Linux native AIO
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Note] InnoDB: Uses event mutexes
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Note] InnoDB: Number of pools: 1
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Note] InnoDB: Initializing buffer pool, total size = 5368709120, chunk size = 134217728
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Note] InnoDB: Completed initialization of buffer pool
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [ERROR] InnoDB: Upgrade after a crash is not supported. The redo log was created with Backup 10.3.27-MariaDB.
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Note] InnoDB: Starting shutdown...
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [ERROR] Plugin 'InnoDB' init function returned error.
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Note] Plugin 'FEEDBACK' is disabled.
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [Warning] 'thread-concurrency' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [ERROR] Unknown/unsupported storage engine: InnoDB
Feb 23 12:14:12 pochard mariadbd[528012]: 2021-02-23 12:14:12 0 [ERROR] Aborting
Feb 23 12:14:12 pochard mariadbd[528012]: terminate called after throwing an instance of 'wsrep0:0:0:0:0:0:0:0runtime_error'
Feb 23 12:14:12 pochard mariadbd[528012]:   what():  State wait was interrupted
Feb 23 12:14:12 pochard mariadbd[528012]: 210223 12:14:12 [ERROR] mysqld got signal 6 ;
Feb 23 12:14:12 pochard mariadbd[528012]: This could be because you hit a bug. It is also possible that this binary
Feb 23 12:14:12 pochard mariadbd[528012]: or one of the libraries it was linked against is corrupt, improperly built,
Feb 23 12:14:12 pochard mariadbd[528012]: or misconfigured. This error can also be caused by malfunctioning hardware.
Feb 23 12:14:12 pochard mariadbd[528012]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Feb 23 12:14:12 pochard mariadbd[528012]: We will try our best to scrape up some info that will hopefully help
Feb 23 12:14:12 pochard mariadbd[528012]: diagnose the problem, but since we have already crashed,
Feb 23 12:14:12 pochard mariadbd[528012]: something is definitely wrong and this may fail.
Feb 23 12:14:12 pochard mariadbd[528012]: Server version: 10.5.8-MariaDB-3-log
Feb 23 12:14:12 pochard mariadbd[528012]: key_buffer_size=536870912
Feb 23 12:14:12 pochard mariadbd[528012]: read_buffer_size=786432
Feb 23 12:14:12 pochard mariadbd[528012]: max_used_connections=0
Feb 23 12:14:12 pochard mariadbd[528012]: max_threads=2002
Feb 23 12:14:12 pochard mariadbd[528012]: thread_count=4
Feb 23 12:14:12 pochard mariadbd[528012]: It is possible that mysqld could use up to
Feb 23 12:14:12 pochard mariadbd[528012]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 3650182 K  bytes of memory
Feb 23 12:14:12 pochard mariadbd[528012]: Hope that's ok; if not, decrease some variables in the equation.
Feb 23 12:14:12 pochard mariadbd[528012]: Thread pointer: 0x7feb08002128
Feb 23 12:14:12 pochard mariadbd[528012]: Attempting backtrace. You can use the following information to find out
Feb 23 12:14:12 pochard mariadbd[528012]: where mysqld died. If you see no messages after this, something went
Feb 23 12:14:12 pochard mariadbd[528012]: terribly wrong...
Feb 23 12:14:12 pochard mariadbd[528012]: stack_bottom = 0x7feb0fffeb48 thread_stack 0x30000
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(my_print_stacktrace)[0x55811ec0647e]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(handle_fatal_signal)[0x55811e7172d5]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(__restore_rt)[0x7feb34a4d140]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(gsignal)[0x7feb34596ce1]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(abort)[0x7feb34580537]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(__cxa_throw_bad_array_new_length)[0x7feb349007ec]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(st0:0:0:0:0:0:0:0rethrow_exception(st0:0:0:0:0:0:0:0__exception_ptr0:0:0:0:0:0:0:0xception_ptr))[0x7feb3490b966]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(st0:0:0:0:0:0:0:0terminate())[0x7feb3490b9d1]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(__cxa_throw)[0x7feb3490bc65]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(Wsrep_server_servi0:0:0:0:0:0:0:0log_dummy_write_set(wsrep0:0:0:0:0:0:0:0lient_state&, wsrep0:0:0:0:0:0:0:0ws_meta const&))[0x55811e421112]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(wsrep0:0:0:0:0:0:0:0server_stat0:0:0:0:0:0:0:0sst_received(wsrep0:0:0:0:0:0:0:0lient_service&, int))[0x55811ec7a63b]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(void st0:0:0:0:0:0:0:0vector<char, st0:0:0:0:0:0:0:0llocator<char> >0:0:0:0:0:0:0:0_M_realloc_insert<char const&>(__gnu_cxx0:0:0:0:0:0:0:0__normal_iterator<char*, st0:0:0:0:0:0:0:0vector<char, st0:0:0:0:0:0:0:0llocator<char> > >, char const&))[0x55811e9bb00a]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(void st0:0:0:0:0:0:0:0vector<char, st0:0:0:0:0:0:0:0llocator<char> >0:0:0:0:0:0:0:0_M_realloc_insert<char const&>(__gnu_cxx0:0:0:0:0:0:0:0__normal_iterator<char*, st0:0:0:0:0:0:0:0vector<char, st0:0:0:0:0:0:0:0llocator<char> > >, char const&))[0x55811e9bbc64]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(MyCTX_nop0:0:0:0:0:0:0:0inish(unsigned char*, unsigned int*))[0x55811e94eee2]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(start_thread)[0x7feb34a41ea7]
Feb 23 12:14:12 pochard mariadbd[528012]: ??:0(clone)[0x7feb34658def]
Feb 23 12:14:12 pochard mariadbd[528012]: Trying to get some variables.
Feb 23 12:14:12 pochard mariadbd[528012]: Some pointers may be invalid and cause the dump to abort.
Feb 23 12:14:12 pochard mariadbd[528012]: Query (0x0): (null)
Feb 23 12:14:12 pochard mariadbd[528012]: Connection ID (thread ID): 3
Feb 23 12:14:12 pochard mariadbd[528012]: Status: NOT_KILLED
Feb 23 12:14:12 pochard mariadbd[528012]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
Feb 23 12:14:12 pochard mariadbd[528012]: The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
Feb 23 12:14:12 pochard mariadbd[528012]: information that should help you find out what is causing the crash.
Feb 23 12:14:12 pochard mariadbd[528012]: We think the query pointer is invalid, but we will try to print it anyway.
Feb 23 12:14:12 pochard mariadbd[528012]: Query:
Feb 23 12:14:12 pochard mariadbd[528012]: Writing a core file...
Feb 23 12:14:12 pochard mariadbd[528012]: Working directory at /var/lib/mysql
Feb 23 12:14:12 pochard mariadbd[528012]: Resource Limits:
Feb 23 12:14:12 pochard mariadbd[528012]: Limit                     Soft Limit           Hard Limit           Units
Feb 23 12:14:12 pochard mariadbd[528012]: Max cpu time              unlimited            unlimited            seconds
Feb 23 12:14:12 pochard mariadbd[528012]: Max file size             unlimited            unlimited            bytes
Feb 23 12:14:12 pochard mariadbd[528012]: Max data size             unlimited            unlimited            bytes
Feb 23 12:14:12 pochard mariadbd[528012]: Max stack size            8388608              unlimited            bytes
Feb 23 12:14:12 pochard mariadbd[528012]: Max core file size        0                    unlimited            bytes
Feb 23 12:14:12 pochard mariadbd[528012]: Max resident set          unlimited            unlimited            bytes
Feb 23 12:14:12 pochard mariadbd[528012]: Max processes             47790                47790                processes
Feb 23 12:14:12 pochard mariadbd[528012]: Max open files            16384                16384                files
Feb 23 12:14:12 pochard mariadbd[528012]: Max locked memory         65536                65536                bytes
Feb 23 12:14:12 pochard mariadbd[528012]: Max address space         unlimited            unlimited            bytes
Feb 23 12:14:12 pochard mariadbd[528012]: Max file locks            unlimited            unlimited            locks
Feb 23 12:14:12 pochard mariadbd[528012]: Max pending signals       47790                47790                signals
Feb 23 12:14:12 pochard mariadbd[528012]: Max msgqueue size         819200               819200               bytes
Feb 23 12:14:12 pochard mariadbd[528012]: Max nice priority         0                    0
Feb 23 12:14:12 pochard mariadbd[528012]: Max realtime priority     0                    0
Feb 23 12:14:12 pochard mariadbd[528012]: Max realtime timeout      unlimited            unlimited            us
Feb 23 12:14:12 pochard mariadbd[528012]: Core pattern: core

How can I get this one to re-join?

I do plan on upgrading the other nodes, but now I’m a bit scared that I didn’t do something right and if I can’t get this third node introduced into the cluster before I do the upgrade, I may have more issues.

vadimtk · February 24, 2021, 11:35am

If I am to guess you found a bug in MariaDB Cluster and MariaDB backup SST.
I suggest you report bug to MariaDB.

CTutte · February 24, 2021, 12:18pm

Also it’s suggested that all nodes in the cluster have same version and configuration.
Differences among the node can trigger bugs or unexpected behavior/performance issues that will impact the other nodes.
Try downgrading the package version from 10.5 to 10.3. If you wish to upgrade, try upgrading to latest minor version first and when doing a major upgrade, do 1 major version upgrade at a time (don’t skip 10.4)

Regards

rantoie · February 24, 2021, 1:18pm

How is it possible to have all nodes have the same version when upgrading?

Unfortunately, the upgrade path on Debian is from 10.3 to 10.5, there is no 10.4 available.

CTutte · February 24, 2021, 1:48pm

Hi again rantoie.

Here are the instructions for upgrading from 10.3 to 10.4: Upgrading from MariaDB 10.3 to MariaDB 10.4 with Galera Cluster - MariaDB Knowledge Base

Which debian version are you using?
Are you upgrading OS version at the same time as Mariadb?
Do you have galera version 26.4.2 or later on all nodes?

As explained on the official doc:
“Note that when upgrading the Galera wsrep provider, sometimes the Galera protocol version can change. The Galera wsrep provider should not start using the new protocol version until all cluster nodes have been upgraded to the new version, so this is not generally an issue during a rolling upgrade. However, this can cause issues if you restart a non-upgraded node in a cluster where the rest of the nodes have been upgraded”

Regards

rantoie · February 24, 2021, 1:55pm

I upgraded from Debian Buster, to Debian Bullseye, it was an OS upgrade.

The version of Mariadb in Buster is 10.3.27, and there is both a galera-3 package (version 25.3.25) and a galera-4 package (26.4.5) available in Buster.

It appears that Bullseye has Mariadb version 10.5.8, and galera-3 version 25.3.31 and galera-4 version 26.4.6.

The version of mariadb-backup is 10.3.27 in Buster, and 10.5.8 in Bullseye. Its not possible to install a different version without also changing the mariadb server version at the same time.

To recap: right now I’ve got two machines on Debian Buster, and the third one is on Bullseye. The third one I cannot get to re-join the cluster, although after the upgrade it was fine. It only stopped working when I had to rebuild the cluster from scratch.

I have a few options before me, and I’m unsure the right way to go.

rantoie · February 25, 2021, 10:23pm

I ended up reinstalling the broken one to the previous release, having it join the cluster so I had a healthy 3 nodes again. Then I proceeded to upgrade each one, knowing that if I had to rebuild the entire cluster from scratch, i’d be in this situation again… so I hurried that upgrade and now they all are running the same version.

Topic		Replies	Views
Galera SST failure on additional node MariaDB Server & Utilities	21	17483	February 18, 2021
MariaDB/Galera, donor node stops responding when SST fails to a new node, and brings the cluster down, MariaDB Server & Utilities	2	1820	July 6, 2021
SST fails when SSL is enabled Percona XtraBackup	0	991	September 30, 2016
Second node won't join cluster/SST fails Percona XtraDB Cluster 5.x	3	11384	November 17, 2015
Xtrabackup broken with MariaDB Galera Cluster Percona XtraBackup	1	1609	December 16, 2015

Galera node fails to join because SST failure due to redo log created with older backup

Related topics