Percona mysql 8.0.22-13 crashes on json queries

We have a (1)master - (6)slave setup, which the master server crashes for about 5 times a day. It appears to be quite random that it does not always crash at the highest load. Sometimes it crashes at off-peak hours. The slave never crashes.

Looked at the mysql log, it appears to be some very normal SELECT query at the time of crash.
SELECT user_auth.*, DATE(user_auth.premium_expire_date) AS premium_expire_date FROM user_auth WHERE user_auth.fub_id_people = '31763'

I am clueless here, which way should I look to find the cause of crashes?


The red arrows are when crash happened.

  • Host: 8 core 16gb Centos 7.6
  • Percona mysql 8.0.22-13 (I upgraded to the latest version)
  • Size of database: 12gb
  • The load on master is light, CPU never exceed 40%.
  • no. of queries ranging from 200 ~ 790 at the time crash happens. There is no clear pattern that crash is related to load.

mysqld.cnf

[mysqld]
lower_case_table_names=1
event_scheduler=ON
innodb_file_per_table=ON
innodb_ft_min_token_size = 1
innodb_ft_enable_stopword =OFF
innodb_ft_server_stopword_table=realagent_prd/_index_stopword
innodb_buffer_pool_size = 10240M
innodb_buffer_pool_instances = 5
log_bin                 = mysql-bin
server-id               = 1
binlog_do_db            = realagent_prd
binlog_expire_logs_seconds = 259200
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
symbolic-links=0
log-error=/var/log/mysql/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

The crash logs are very similar each time:

23:22:42 UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.

Build ID: e8de225be5945e9ad0279fcd08607fdf7a4b6c92
Server Version: 8.0.22-13 Percona Server (GPL), Release 13, Revision 6f7822f

Thread pointer: 0x7fd96806c450
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fdc80052c70 thread_stack 0x46000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x210c1ed]
/usr/sbin/mysqld(handle_fatal_signal+0x3c3) [0x1260d53]
/lib64/libpthread.so.0(+0xf630) [0x7fdcc1e11630]
/usr/sbin/mysqld() [0x23de97d]
/usr/sbin/mysqld(trx_undo_update_rec_get_update(unsigned char const*, dict_index_t const*, unsigned long, unsigned long, unsigned long, unsigned long, trx_t*, mem_block_info_t*, upd_t**, lob::undo_vers_t*, type_cmpl_t&)+0x80b) [0x23e08ab]
/usr/sbin/mysqld(trx_undo_prev_version_build(unsigned char const*, mtr_t*, unsigned char const*, dict_index_t const*, unsigned long*, mem_block_info_t*, unsigned char**, mem_block_info_t*, dtuple_t const**, unsigned long, lob::undo_vers_t*)+0x47b) [0x23e242b]
/usr/sbin/mysqld(row_vers_build_for_consistent_read(unsigned char const*, mtr_t*, dict_index_t*, unsigned long**, ReadView*, mem_block_info_t**, mem_block_info_t*, unsigned char**, dtuple_t const**, lob::undo_vers_t*)+0x25c) [0x2392e2c]
/usr/sbin/mysqld(row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+0x38f1) [0x237fa91]
/usr/sbin/mysqld(ha_innobase::general_fetch(unsigned char*, unsigned int, unsigned int)+0x298) [0x21f6448]
/usr/sbin/mysqld(handler::ha_rnd_next(unsigned char*)+0x66) [0xe26b26]
/usr/sbin/mysqld(TableScanIterator::Read()+0x1d) [0x104417d]
/usr/sbin/mysqld(FilterIterator::Read()+0x14) [0x131f214]
/usr/sbin/mysqld(SELECT_LEX_UNIT::ExecuteIteratorQuery(THD*)+0x3e3) [0x11e0793]
/usr/sbin/mysqld(SELECT_LEX_UNIT::execute(THD*)+0x2c) [0x11e097c]
/usr/sbin/mysqld(Sql_cmd_dml::execute_inner(THD*)+0x3cb) [0x1164b6b]
/usr/sbin/mysqld(Sql_cmd_dml::execute(THD*)+0x6c0) [0x116f5b0]
/usr/sbin/mysqld(mysql_execute_command(THD*, bool)+0xaf8) [0x110e5a8]
/usr/sbin/mysqld(Prepared_statement::execute(String*, bool)+0x8e8) [0x1140ef8]
/usr/sbin/mysqld(Prepared_statement::execute_loop(String*, bool)+0xff) [0x114538f]
/usr/sbin/mysqld(mysqld_stmt_execute(THD*, Prepared_statement*, bool, unsigned long, PS_PARAM*)+0x191) [0x1145951]
/usr/sbin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x19a5) [0x1115105]
/usr/sbin/mysqld(do_command(THD*)+0x204) [0x1116574]
/usr/sbin/mysqld() [0x1251c40]
/usr/sbin/mysqld() [0x2620ea4]
/lib64/libpthread.so.0(+0x7ea5) [0x7fdcc1e09ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fdcbff5896d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fd968028938): SELECT user_auth.*             ,DATE(user_auth.premium_expire_date) AS premium_expire_date         FROM user_auth         WHERE user_auth.fub_id_people = '31763'
Connection ID (thread ID): 40788
Status: NOT_KILLED

The table in question:

CREATE TABLE `user_auth` (
	`id` INT NOT NULL AUTO_INCREMENT,
	`full_name` VARCHAR(50) NOT NULL DEFAULT '',
	`password_change` VARCHAR(50) NOT NULL DEFAULT '' ,
	`password_hash` VARCHAR(65) NOT NULL DEFAULT '',
	`password_change_date` TIMESTAMP NULL DEFAULT NULL,
	`type` INT NOT NULL DEFAULT '1',
	`phone` VARCHAR(50) NULL DEFAULT '',
	`extra` JSON NULL DEFAULT NULL,
	`business` JSON NULL DEFAULT NULL,
	`airtable_base` TEXT NULL,
	`fub_id` INT NULL DEFAULT NULL,
	`fub_id_people` INT NULL DEFAULT NULL,
	`fub_assigned_user_id` INT NULL DEFAULT NULL,
	`premium_expire_date` TIMESTAMP NULL DEFAULT NULL,
	`verified` INT NOT NULL DEFAULT '0',
	`send_sms_count` INT NULL DEFAULT '0',
	`send_sms_date` TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
	`createdOn` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
	`updatedOn` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
	PRIMARY KEY (`id`),
	UNIQUE INDEX `phone` (`phone`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
AUTO_INCREMENT=576835
;
1 Like

Hi Bill,

Thanks for providing enough information for trying to troubleshoot. Sadly, it’s not enough to reproduce from my side, nor I can find similar bug reports.

In case like this, we need to either get a fully reproducible test case, or analyze a core dump (which is not viable through the forum).

Reading the stack trace, it can be seen that there is a prepared statement, and then MVCC and undo logs play a part on this (and the problem might not be the query alone ).

I have checked upstream MySQL 8.0.23 release notes MySQL :: MySQL 8.0 Release Notes :: Changes in MySQL 8.0.23 (2021-01-18, General Availability) and there are a few bug fixes for both JSON and prepared statements, although the bug reports are not public and I cannot correlate the above stack trace to be fixed on 8.0.23. It’s a long shot, but since there were some bugs fixes and changes related to code path of json and prepared statements, it might be possible that the issue is fixed on 8.0.23. Since upstream MySQL and Percona Server are drop in replacements, you can consider upgrading to 8.0.23 upstream to check if this fixes your issue , and when PS 8.0.23 is released switch back to Percona Server. (it usually takes around 1 month after an upstream release for PS to release same version). Take into consideration that due to MyQSL 8 inability for downgrade, if you upgrade to 8.0.23, you won’t be able to revert back to 8.0.22 unless you do a logical backup and restore.

Other than trying to upgrade to 8.0.23, I would also suggest checking error.log to see if there are any other error, check dmesg to find out if there are any HW issues, and checking max history list length in case there was an open transaction that made history list length increase, causing more undo segments to exist. More info:
https://dev.mysql.com/doc/refman/8.0/en/innodb-multi-versioning.html

If you do get a reproducible test case, find any errors on log/dmesg or find any other anomalies on the server at the time of the crash, let us know and we will provide further feedback.

1 Like

Thanks Tutte, I wasn’t expecting a reply this fast!

The concept of MVCC and undo logs are new to me, I am going to study the topic.

Since yesterday I have tried multiple things in attempt to fix the issue. Because it is our production system, I have to mitigate the crash first.

  • optimize table user_auth
  • move select query in error log from master to slave.
  • Our web developer pointed out that the query is running on a non-index col fub_id_people, which takes about 0.5s. I have added index to the col. Query time now reduce to 0.001s.

Let me watch it for a few days. See if there is something else crashing the server.

1 Like

Hi again Bill,

Thanks for the heads up.

If the crash only occurs at some peaks in load, decreasing the load might avoid the problem (although not fix the underlying issue).
You can consider also switching back to same version from upstream MySQL (to avoid upgrade/downgrade issue) to confirm if the issue happens only on Percona Server or in both.

1 Like

Good news, watched it for 24 hrs after I made the changes, master no long crashes. The current fixe only “delays” the problem until we have 2x,4x traffic hitting the server. But it buys us time to figure out the problem.

And I have some discovery after watching show engine innodb status; for a while. The table in question has 500k+ rows, each JSON cell is a few kb. It seems that queries on un-indexed JSON column generate very deep transaction locks. If my understanding about MVCC is correct, there are many copies of the JSON data in history list.

An un-indexed query runs about 1s to 2.5s. In that 2.5s, if we have an automated job updating hundreds of rows JSON data, that created even more copies in history list. At some point it crashed.

This is not a risk I can take. (Who knows what will happen if I switch to upstream) The PRD site is serving 100k users per day, anything happens will be catastrophic to the business. However I am going to migrate master to a server with 2x RAM, just to rule out hardware problems.

Appreciate the help! I feel that the problems requires knowledge of innodb internals, we might need to engage Percona for consulting work if the problem comes back.

1 Like

Thanks for the heads up.

And I have some discovery after watching show engine innodb status; for a while. The table in question has 500k+ rows, each JSON cell is a few kb. It seems that queries on un-indexed JSON column generate very deep transaction locks. If my understanding about MVCC is correct, there are many copies of the JSON data in history list.

An un-indexed query runs about 1s to 2.5s. In that 2.5s, if we have an automated job updating hundreds of rows JSON data, that created even more copies in history list. At some point it crashed.

Makes sense. Consider doing shorter transactions so that undo logs can be purged because while a long lived transaction is open MVCC won’t purge the rows in order to assure repeatable read for that long lived transaction (and all other transactions that came after that one).
By having shorter transactions, history list length should remain smaller.

Do note that there is 1 innodb purge thread used per table, so if you have 1 massive table where there is a lot of activity, it can become a bottleneck. And last, do check if you have disk saturation and or swapping occurring which can be detrimental.

Regards

1 Like