Help - MySQL Crash - need help debugging

So this evening, my mysql (8.0.31) instance restarted, looks like the OOM took it down for which I need to investigate.

However, upon restart… its not yet come back up, I’m looking for some help / assistance / guidance here on what I can do to bring it back up (even to drop broken tables / deal with data recovery).

When it starts back up, it gets so far, then crashes wiht an error about possible corruption of an index page - but no clue which one? I’ve been through various innodb_force_recovery modes with no success, I’ve also set innocheck running on all idb files - it found absolutely nothing wrong.

2023-05-18T19:38:14.995693Z 0 [Warning] [MY-010139] [Server] Changed limits: max_open_files: 10000 (requested 12500)
2023-05-18T19:38:14.995699Z 0 [Warning] [MY-010142] [Server] Changed limits: table_open_cache: 3745 (requested 4000)
2023-05-18T19:38:15.177912Z 0 [Warning] [MY-011069] [Server] The syntax '--master-info-repository' is deprecated and will be removed in a future release.
2023-05-18T19:38:15.177920Z 0 [Warning] [MY-011069] [Server] The syntax '--relay-log-info-repository' is deprecated and will be removed in a future release.
2023-05-18T19:38:15.177932Z 0 [Warning] [MY-011068] [Server] The syntax 'log_slave_updates' is deprecated and will be removed in a future release. Please use log_replica_updates instead.
2023-05-18T19:38:15.177942Z 0 [Warning] [MY-011069] [Server] The syntax '--transaction-write-set-extraction' is deprecated and will be removed in a future release.
2023-05-18T19:38:15.177997Z 0 [Warning] [MY-010915] [Server] 'NO_ZERO_DATE', 'NO_ZERO_IN_DATE' and 'ERROR_FOR_DIVISION_BY_ZERO' sql modes should be used with strict mode. They will be merged with strict mode in a future release.
2023-05-18T19:38:15.179186Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.31) starting as process 3953821
2023-05-18T19:38:15.182025Z 0 [Warning] [MY-013242] [Server] --character-set-server: 'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in orde$
2023-05-18T19:38:15.182038Z 0 [Warning] [MY-013244] [Server] --collation-server: 'utf8mb3_unicode_ci' is a collation of the deprecated character set UTF8MB3. Please consider using UTF8MB4 with an appropriate collation instead.
2023-05-18T19:38:15.198423Z 0 [Warning] [MY-013907] [InnoDB] Deprecated configuration parameters innodb_log_file_size and/or innodb_log_files_in_group have been used to compute innodb_redo_log_capacity=4294967296. Please use innodb_redo$
2023-05-18T19:38:15.201987Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2023-05-18T19:38:18.850561Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2023-05-18T19:38:19.559295Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2023-05-18T19:38:19.559335Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2023-05-18T19:38:19.578247Z 0 [Warning] [MY-010604] [Repl] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=db1-r$
2023-05-18T19:38:19.593617Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.31' socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server - GPL.
2023-05-18T19:38:19.593640Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/run/mysqld/mysqlx.sock
2023-05-18T19:42:20.406260Z 892 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
2023-05-18T19:42:45.684507Z 3169 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2023-05-18T19:42:45.685661Z 892 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the allowlist. It is mandatory that it is added.'
2023-05-18T19:42:45.685693Z 892 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv6 localhost address to the allowlist. It is mandatory that it is added.'
2023-05-18T19:42:45.687532Z 3328 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''$
2023-05-18T19:42:46.792217Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2023-05-18T19:42:46.792607Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.10.0.100:3306 on view 16844389657918695:1.'
2023-05-18T19:42:46.792999Z 0 [System] [MY-011490] [Repl] Plugin group_replication reported: 'This server was declared online within the replication group.'
2023-05-18T19:42:46.793353Z 3527 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.'
2023-05-18T19:46:18.236118Z 0 [ERROR] [MY-012721] [InnoDB] N heap is wrong 179, 155
 len 16384; hex 6ca889eb000000de0000461400002d6000001d80696ad87045bf00000000000000000000021300223f6380b3162e0d6f02a100050000008b0000000000000000000000000000000002fe0000000000000000000000000000000000000000010002001c696e66696d756d0001000b$
InnoDB: End of page dump
InnoDB: Page may be an index page where index id is 766
2023-05-18T19:46:18.304963Z 0 [ERROR] [MY-011937] [InnoDB] [FATAL] Apparent corruption of an index page [page id: space=531, page number=222] to be written to data file. We intentionally crash the server to prevent corrupt data from end$
2023-05-18T19:46:18.304985Z 0 [ERROR] [MY-013183] [InnoDB] Assertion failure: buf0dblwr.cc:1537:ib::fatal triggered thread 139846900352768
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be 

I’m at a bit of a loss here as to what I can do to bring this server back up.

Any help appriciated.

If it helps any…

/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x2128b81]
/usr/sbin/mysqld(print_fatal_signal(int)+0x387) [0xfd62d7]
/usr/sbin/mysqld(my_server_abort()+0x7e) [0xfd642e]
/usr/sbin/mysqld(my_abort()+0xe) [0x2122a9e]
/usr/sbin/mysqld(ut_dbg_assertion_failed(char const*, char const*, unsigned long)+0x33a) [0x242277a]
/usr/sbin/mysqld(ib::fatal::~fatal()+0xc8) [0x2425038]
/usr/sbin/mysqld(Double_write::croak(buf_block_t const*)+0x26c) [0x24912cc]
/usr/sbin/mysqld(Double_write::prepare(buf_page_t const*, void**, unsigned int*)+0x14d) [0x24914fd]
/usr/sbin/mysqld(Double_write::enqueue(buf_flush_t, buf_page_t*, file::Block const*)+0x8f) [0x249f69f]
/usr/sbin/mysqld(dblwr::write(buf_flush_t, buf_page_t*, bool)+0x1d2) [0x249fcd2]
/usr/sbin/mysqld(buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool)+0x32f) [0x24a5eaf]
/usr/sbin/mysqld() [0x24a88c1]
/usr/sbin/mysqld(buf_flush_do_batch(buf_pool_t*, buf_flush_t, unsigned long, unsigned long, unsigned long*)+0x223) [0x24a8eb3]
/usr/sbin/mysqld() [0x24aa1cf]
/usr/sbin/mysqld() [0x24ac4fd]
/usr/sbin/mysqld(void Detached_thread::operator()<void (*)()>(void (*&&)())+0xca) [0x22e78aa]
/lib64/libstdc++.so.6(+0xc2b13) [0x7f35d30edb13]
/lib64/libpthread.so.0(+0x81ca) [0x7f35d43601ca]
/lib64/libc.so.6(clone+0x43) [0x7f35d2704e73] 

Whatever it is, it appears to be very consistent on the page id / space / page number etc.

2023-05-19T00:33:32.233681Z 1 [ERROR] [MY-012711] [InnoDB] Wrong owned count 4, 8, rec 12776
len 16384; hex 87b6a76300002b4a000025f7000053ae00001d80fb212c6145bf00000000000000000000021300163f6380b3151d131e3e590002000100760000000000000000000000000000000002fe000000000000000000000000000

[FATAL] Apparent corruption of an index page [page id: space=531, page number=11082] to be written to data file. We intentionally crash the server to prevent corrupt data from ending up in data files.

Hi @TimIgoe

Was your server upgraded recently to this version?
Also, from the logs it appears this server is part of a Cluster, since it’s trying to do IST. It that’s the case, and you have other nodes in your cluster; maybe you could consider trying with SST and see if that helps.

Best,
Mauricio.

The server hasn’t been upgraded recently, I’m currently restoring the other part of the cluster, it apparently dropped out of sync a few days ago so I’m currently replaying binary logs to bring it back up to date.

If I could work out what was causing the crash / corruption - its possible to move some of the data (tables) out of the way to get MySQL back running again.

Both error logs indicate this is an issue with the doublewrite buffer. You could remove them from MySQL’s datadir and then start mysql. InnoDB will create new ones.

You can opt to set innodb_force_recovery = 1 and see if that allows MySQL to start. Then you should run OPTIMIZE TABLE on every table, to rebuild every table.

https://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html

As Mauricio says, since this is part of a GR, the best option is to wipe the datadir completely, restore a fresh backup, and then join the GR rather than attempt to recover using the parameter above.

Which files are you suggesting to remove from the data directory. Happy to try it.

force recovery 1 through 6 made no difference, still wouldn’t even start.

#ib_16384_*.dblwr and #ib_archive made no difference.

Think the thing thats gettingme the most here, is that i’ve been through all idb tables wiht innocheck and they’ve all “passed”.

Question - Could/Would upgrading MySQL to 8.0.33 help?

If you tried force recovery 1 through 6 and it wouldn’t start, then you have an unrecoverable issue that not even InnoDB can get past.

You erase all the *.dblwr as well? Did MySQL recreate those before crashing again?

Attempting to upgrade while your system is unstable isn’t generally a good idea but at this point, can’t hurt to try. The only way you will fix this is to wipe it and rebuild from one of the other GR members.

Yup, suspected as much.

Wiped all dblwr files, #ib_archive etc - all recreated.

Crash the same, same point, same id’s, same numbers.

Yet, oddly, innocheck passes all idb files with flying colours?

/usr/sbin/mysqld(ut_dbg_assertion_failed(char const*, char const*, unsigned long)+0x33a) [0x242277a]
/usr/sbin/mysqld(ib::fatal::~fatal()+0xc8) [0x2425038]
/usr/sbin/mysqld(Double_write::croak(buf_block_t const*)+0x26c) [0x24912cc]
/usr/sbin/mysqld(Double_write::prepare(buf_page_t const*, void**, unsigned int*)+0x14d) [0x24914fd]
/usr/sbin/mysqld(Double_write::enqueue(buf_flush_t, buf_page_t*, file::Block const*)+0x8f) [0x249f69f]
/usr/sbin/mysqld(dblwr::write(buf_flush_t, buf_page_t*, bool)+0x1d2) [0x249fcd2]
/usr/sbin/mysqld(buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool)+0x32f) [0x24a5eaf]

Everything points to the double write (dblwr) buffer as the problem. But you removed them and they were recreated. What about deleting the innodb buffer pool dump file?

innocheck only checks the tables.

Yep, everything is pointing towards that double write buffer during inno initialisation - always the same error, at the same referenced point.

Upgrading MySQL (Long shot) didn’t solve anything.

Going to have to restore from the replica, but at several TB its going to take a while :slight_smile:

Follow on question.

Replication was stopped on the secondary copy of the database (another error) - but if i start it now, it tries to recover against the failed host. I’ve applied several binlogs to the seconday, how would i now go about resetting the old secondary to become the new master (i.e. make it forget everything it was doing before the original problem and assume that it is now “in charge”)

Is it as simple as removing the -relay-bin-group_replication_applier/recovery files and indexes?

Anyone?

RESET REPLICA ALL

Is that what will drop the current status?