SST failing during xtrabackup recovery

Hello,

I am attempting to migrate our single node Percona instance to a new server, which will then be used to build a cluster.

SST to the new node is failing at what appears to be the final stage. The data copy completes, and the innobackupex log on the joiner shows all tables being expanded, and starts InnoDB data recovery. However, at 100% the recovery fails with the following message;


[FATAL] InnoDB: is_short 0, info_and_status_bits 0, offset 13920, o_offset 6, mismatch index 18446744073709551572, end_seg_len 56 parsed len 3
2016-11-26 04:41:23 0x7fdaf68b7740 InnoDB: Assertion failure in thread 140578415933248 in file ut0ut.cc line 916
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
04:41:23 UTC - xtrabackup got signal 6 ;
This could be because you hit a bug or data is corrupted.
This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x10000
xtrabackup(my_print_stacktrace+0x3b)[0xc7183b]
xtrabackup(handle_fatal_signal+0x281)[0xa8da01]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x113e0)[0x7fdaf64963e0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7fdaf4b0f428]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7fdaf4b1102a]
xtrabackup[0x6f1b41]
xtrabackup(_ZN2ib5fatalD1Ev+0x145)[0x9d62c5]
xtrabackup(_Z25page_cur_parse_insert_recmPKhS0_P11buf_block_tP12dict_index_tP5mtr_t+0x990)[0xcd2cb0]
xtrabackup[0x94263e]
xtrabackup(_Z22recv_recover_page_funcmP11buf_block_t+0xa9e)[0x9442ee]
xtrabackup(_Z20buf_page_io_completeP10buf_page_tb+0x339)[0x7fdab9]
xtrabackup(_Z13buf_read_pageRK9page_id_tRK11page_size_t+0x4e2)[0x958e22]
xtrabackup(_Z16buf_page_get_genRK9page_id_tRK11page_size_tmP11buf_block_tmPKcmP5mtr_tb+0x37c)[0x7fb89c]
xtrabackup(_Z21ibuf_init_at_db_startv+0x571)[0x91ff91]
xtrabackup(_Z9dict_bootv+0x99b)[0x8136bb]
xtrabackup(_Z34innobase_start_or_create_for_mysqlv+0x3c2c)[0x8b5a2c]
xtrabackup[0x6ea187]
xtrabackup[0x6ec322]
xtrabackup(main+0xcbf)[0x6f9fdf]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fdaf4afa830]
xtrabackup(_start+0x29)[0x70e619]

Please report a bug at https://bugs.launchpad.net/percona-xtrabackup

The donor thinks all is OK for a short moment, but then when the joiner crashes returns back to the single node cluster.

I’ve retried the replication multiple times, and the failure occurs at the same point (sometimes at 99%) every time. The message has differed slightly, but generally seems to be something that would indicate assertion failure / corruption.

I’ve attached the configurations of both the donor and joiner nodes.

Any help would be much appreciated. It seems like the process is mostly working but failing at the very last moment. I can post more logs if it helps.

Thanks.

donor-my-cnf.txt (1.97 KB)

joiner-my-cnf.txt (1.54 KB)


xtrabackup version 2.4.4 based on MySQL server 5.7.13 Linux (x86_64) (revision id: df58cf2)

Looks like some in-compatibility between donated data and processing XB.

From description it appears that you had Percona Server instance that you moved to another server to form a PXC first node ?
(Wondering how did you do this or if there is anything got missed in this part).

The original instance wasn’t moved, I replaced the percona-server packages with the percona-xtradb-cluster packages, added in the wsrep configuration and started up the single node cluster using the data from the old standalone instance. This was following a Percona guide.

Our current findings indicate the InnoDB recovery process isn’t reading the correct innodb configuration values from my.cnf. Originally we thought this was because the configuration was externalised in the conf.d directory as we found a bug report suggesting the !includeDir isn’t followed by the lightweight InnoDB instance.

However, even after putting everything into the my.cnf directly, the innobackupex.prepare.log still indicates that it is using configuration values that are different to what is confiured in my.cnf. The property of most concern is innodb_log_file_size. This is set to 128M in my.cnf which has carried over since the standalone Percona instance. When inspecting the prepare log, this value is set to 29570629632, see below.


xtrabackup: using the following InnoDB configuration for recovery:
xtrabackup: innodb_data_home_dir = .
xtrabackup: innodb_data_file_path = ibdata1:12M:autoextend
xtrabackup: innodb_log_group_home_dir = .
xtrabackup: innodb_log_files_in_group = 1
[B]xtrabackup: innodb_log_file_size = 29570629632[/B]
xtrabackup: Starting InnoDB instance for recovery.
xtrabackup: Using 104857600 bytes for buffer pool (set by --use-memory parameter)

innodb_log_fies_in_group is also set to 2 in my.cnf, but xtrabackup InnoDB recovery is setting this to 1.

We’ve setup two local VMs with minimal data and tried the same process with similar configuration.

We have observed the same behaviour as our live environment.

The donor logs indicate correct recovery InnoDB configuration, but the joiner is all wrong. The joiner only creates a single ib_logfile, which would indicate that the log_files_in_group property of 1 is definitely being [incorrectly] used, even though xtrabackup would usually moan about this and set it to 2 if detected when run from the command line.

We may have found the problem as we’ve now successfully completed the SST on our local VMs.

We had the compact option enabled in xtrabackup.

Looks like we’ve hit this bug; [url]https://bugs.launchpad.net/percona-xtrabackup/+bug/1192834[/url]

This bug appears to be a day one xtrabackup bug.

Why has the compact option not simply been removed or disabled, or huge warnings been put in place? There is a warning on the SST configuration page about the bug, but nothing in the xtrabackup configuration.

I’ve re-initiated the SST on our live environment without compact, and will report back if this is the resolution.

[B]RESOLVED

compact[/B] was the problem. SST now complete and nodes in sync.

So for anyone stumbling across this, if you got compact in your [xtrabackup] config… That’s your problem.

Hopefully Percona will sort this out by either ignoring the option automatically or making it work.