Reoccurring Signal 11 on multiple clusters

Hello,

we have upgraded to MySQL 8 in December and struggle with reoccurring Signal 11 crashes since then. We’ve managed to stabilise the environment (5 Clusters with 2 nodes each) for quite a bit, but we still have about 2-10 Signal 11 occurrences per month though. We just can’t get rid of it and we have no clue what the actual reason could be … therefore hoping for some help from you guys out there!

So first of all the general information:
Packages:

  • percona-server-client-8.0.19-10.1.el8.x86_64
  • percona-release-1.0-25.noarch
  • percona-server-shared-8.0.19-10.1.el8.x86_64
  • percona-server-shared-compat-8.0.19-10.1.el8.x86_64
  • percona-server-devel-8.0.19-10.1.el8.x86_64
  • percona-server-server-8.0.19-10.1.el8.x86_64

:information_source: yes, I know … it is not the newest release. But we tried an upgrade to a newer release already and the number of signal 11 incidents increased. Hence we’d like to nail down the reason first before we just try to upgrade again (which takes a lot of time! and don’t forget - it’s a one way upgrade … so getting back to the original version is quite a lot of work)

OS:

  • CentOS 8.2 (4.18.0-193.19.1.el8_2.x86_64)

More info on the situation:
Out of the 5 clusters only 1 cluster never had Signal 11 errors. Despite having less resources, less databases, less tables, etc… the setup is identical (same packages, OS, …). So I’d say it is fair to assume the Signal 11 error is related to either the data or the way the application is using the Percona Servers that triggers the crashes.

From the other 4 clusters that experienced Signal 11 crashes only 1 is still having the crashes regularly (at least twice a month). On the others the crashes occur only very rare (maybe every 2-3 month), but it is still happening though. A lot of people looked into it here but we really can’t put a finger on what is causing it in the end.

Here is the most recent Signal 11 errors:

11:22:07 UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x7e7e77f8a360
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
stack_bottom = 7f03e46f8d50 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x1f88161]
/usr/sbin/mysqld(handle_fatal_signal+0x333) [0x11eda53]
/lib64/libpthread.so.0(+0x12dd0) [0x7f0d93aacdd0]
/usr/sbin/mysqld(MDL_ticket::has_stronger_or_equal_type(enum_mdl_type) const+0xb) [0xf3ccbb]
/usr/sbin/mysqld(MDL_ticket_store::find_in_hash(MDL_request const&) const+0x65) [0xf3eab5]
/usr/sbin/mysqld(MDL_context::find_ticket(MDL_request*, enum_mdl_duration*)+0x19) [0xf3eb59]
/usr/sbin/mysqld(MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**)+0x45) [0xf40735]
/usr/sbin/mysqld(MDL_context::acquire_lock(MDL_request*, unsigned long)+0xa7) [0xf411f7]
/usr/sbin/mysqld(open_table(THD*, TABLE_LIST*, Open_table_context*)+0x12b8) [0x101b808]
/usr/sbin/mysqld(open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)+0x473) [0x1020ee3]
/usr/sbin/mysqld(dd::Open_dictionary_tables_ctx::open_tables()+0xb6) [0x1d321f6]
/usr/sbin/mysqld(bool dd::cache::Storage_adapter::get<dd::Item_name_key, dd::Abstract_table>(THD*, dd::Item_name_key const&, enum_tx_isolation, bool, dd::Abstract_table const**)+0xc4) [0x1da7184]
/usr/sbin/mysqld(bool dd::cache::Shared_dictionary_cache::get<dd::Item_name_key, dd::Abstract_table>(THD*, dd::Item_name_key const&, dd::cache::Cache_elementdd::Abstract_table)+0x7f) [0x1d9ba9f]
/usr/sbin/mysqld(bool dd::cache::Dictionary_client::acquire<dd::Item_name_key, dd::Abstract_table>(dd::Item_name_key const&, dd::Abstract_table const
, bool*, bool*)+0x28d) [0x1d5d38d]
/usr/sbin/mysqld(bool dd::cache::Dictionary_client::acquiredd::Abstract_table(std::__cxx11::basic_string<char, std::char_traits, Stateless_allocator<char, dd::String_type_alloc, My_free_functor> > const&, std::__cxx11::basic_string<char, std::char_traits, Stateless_allocator<char, dd::String_type_alloc, My_free_functor> > const&, dd::Abstract_table const**)+0x1a9) [0x1d5ec19]
/usr/sbin/mysqld(get_table_share(THD*, char const*, char const*, char const*, unsigned long, bool, bool)+0x7d8) [0x1019cc8]
/usr/sbin/mysqld(open_table(THD*, TABLE_LIST*, Open_table_context*)+0xd10) [0x101b260]
/usr/sbin/mysqld(open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)+0x473) [0x1020ee3]
/usr/sbin/mysqld(open_tables_for_query(THD*, TABLE_LIST*, unsigned int)+0x93) [0x10219d3]
/usr/sbin/mysqld(Sql_cmd_dml::prepare(THD*)+0xd8) [0x10f2358]
/usr/sbin/mysqld(Sql_cmd_dml::execute(THD*)+0xcf) [0x10fd2df]
/usr/sbin/mysqld(mysql_execute_command(THD*, bool)+0x3bd0) [0x10aaac0]
/usr/sbin/mysqld(mysql_parse(THD*, Parser_state*, bool)+0x408) [0x10acb78]
/usr/sbin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x2025) [0x10af125]
/usr/sbin/mysqld(do_command(THD*)+0x20c) [0x10afe4c]
/usr/sbin/mysqld() [0x11decd0]
/usr/sbin/mysqld() [0x2473a40]
/lib64/libpthread.so.0(+0x82de) [0x7f0d93aa22de]
/lib64/libc.so.6(clone+0x43) [0x7f0d91a8ce83]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7e99c41b4408): UPDATE email_289_detailclick AS DC INNER JOIN email_289_subscriber AS S ON DC.email_subscriber_id=S.id SET DC.email_subscriber_id=NULL, DC.ident_hash=IFNULL(DC.ident_hash, MD5(CONCAT(’ r J th g K\nA < nLc % \nH % |s A _ M '0TY l ‘, UNHEX(MD5(CONCAT(’ x \r !+ V & O " O + 5c{ ' MZ>3 J ', DC.email_subscriber_id)))))) WHERE S.type!=‘test’ AND S.subscriber_id IN (SELECT id FROM mw3_account_2412.aggregator_643941203 )
Connection ID (thread ID): 70373877
Status: NOT_KILLED

:information_source: the statements are usually very, very long. So the statements in the signal 11 errors are not complete of course. the statements vary from crash to crash …

We can’t really see any errors in the logs other than:

[Warning] [MY-011825] [InnoDB] Cannot add field <<field>> in table <<database>>.<<table>> because after adding it, the row size is 8139 which is greater than maximum allowed size (8126) for a record on index leaf page.

:information_source: The customer is aware that they need to redesign their tables … but I don’t think this is the reason for the signal 11 errors.

Any help is very much appreciated!! :bowing_man:

Thanks & Best Regards

1 Like

Sig 11 occurs when MySQL tries to access memory that it is not allowed to access and the kernel kills the process. How much maximum memory is configured? Run this SQL:

SELECT CONCAT((@@key_buffer_size + (@@innodb_buffer_pool_size * 1.05 + 20*1024*1024) + @@innodb_log_buffer_size 
+ @@max_connections * (@@read_buffer_size + @@read_rnd_buffer_size + @@sort_buffer_size + @@join_buffer_size + @@binlog_cache_size + @@tmp_table_size
+ @@thread_stack)) / 1024/1024/1024, ' GB') AS "POTENTIAL MEMORY USAGE";

Is that more memory than the server has? If so, you need to back off on buffer pool size, set the other buffers to default values, and potentially reduce max_connections.

Also, running clusters with only 2 nodes is very dangerous. There is 100% possibility of split-brain when a node goes down. You should have 3 nodes in each cluster.

1 Like

Good Morning matthewb!

thanks for your feedback. Yes, agreed, that is usually the problem with Sig11. But the thing is

  1. that we still have plenty(!) of memory available whenever the crash happens (usually 200G+). See the memory graph below from the most recent crash this weekend:

    Add Info - Sig11 Error:

02:58:33 UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x7f35b8028130
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
stack_bottom = 7f94d75fcd50 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x1f88161]
/usr/sbin/mysqld(handle_fatal_signal+0x333) [0x11eda53]
/lib64/libpthread.so.0(+0x12dd0) [0x7fc062b1ddd0]
/usr/sbin/mysqld(MDL_ticket::has_stronger_or_equal_type(enum_mdl_type) const+0xb) [0xf3ccbb]
/usr/sbin/mysqld(MDL_ticket_store::find_in_hash(MDL_request const&) const+0x65) [0xf3eab5]
/usr/sbin/mysqld(MDL_context::find_ticket(MDL_request*, enum_mdl_duration*)+0x19) [0xf3eb59]
/usr/sbin/mysqld(MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**)+0x45) [0xf40735]
/usr/sbin/mysqld(MDL_context::acquire_lock(MDL_request*, unsigned long)+0xa7) [0xf411f7]
/usr/sbin/mysqld(open_table(THD*, TABLE_LIST*, Open_table_context*)+0x12b8) [0x101b808]
/usr/sbin/mysqld(open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)+0x473) [0x1020ee3]
/usr/sbin/mysqld(dd::Open_dictionary_tables_ctx::open_tables()+0xb6) [0x1d321f6]
/usr/sbin/mysqld(bool dd::cache::Storage_adapter::get<dd::Item_name_key, dd::Abstract_table>(THD*, dd::Item_name_key const&, enum_tx_isolation, bool, dd::Abstract_table const
)+0xc4) [0x1da7184]
/usr/sbin/mysqld(bool dd::cache::Shared_dictionary_cache::get<dd::Item_name_key, dd::Abstract_table>(THD
, dd::Item_name_key const&, dd::cache::Cache_element<dd::Abstract_tab
le>
)+0x7f) [0x1d9ba9f]
/usr/sbin/mysqld(bool dd::cache::Dictionary_client::acquire<dd::Item_name_key, dd::Abstract_table>(dd::Item_name_key const&, dd::Abstract_table const
*, bool*, bool*)+0x28d)
[0x1d5d38d]
/usr/sbin/mysqld(bool dd::cache::Dictionary_client::acquiredd::Abstract_table(std::_cxx11::basic_string<char, std::char_traits, Stateless_allocator<char, dd::String
type_alloc, My_free_functor> > const&, std::__cxx11::basic_string<char, std::char_traits, Stateless_allocator<char, dd::String_type_alloc, My_free_functor> > const&, dd
::Abstract_table const**)+0x1a9) [0x1d5ec19]
/usr/sbin/mysqld(lock_check_constraint_names(THD*, TABLE_LIST*)+0x11d) [0x1124ddd]
/usr/sbin/mysqld(mysql_rm_db(THD*, MYSQL_LEX_CSTRING const&, bool)+0xf63) [0x103a033]
/usr/sbin/mysqld(mysql_execute_command(THD*, bool)+0x3b6b) [0x10aaa5b]
/usr/sbin/mysqld(mysql_parse(THD*, Parser_state*, bool)+0x408) [0x10acb78]
/usr/sbin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x2025) [0x10af125]
/usr/sbin/mysqld(do_command(THD*)+0x20c) [0x10afe4c]
/usr/sbin/mysqld() [0x11decd0]
/usr/sbin/mysqld() [0x2473a40]
/lib64/libpthread.so.0(+0x82de) [0x7fc062b132de]
/lib64/libc.so.6(clone+0x43) [0x7fc060afde83]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f35b9103478): DROP DATABASE IF EXISTS mw3_export_202395452
Connection ID (thread ID): 228666748
Status: NOT_KILLED

  1. yes, we are overcommitting memory. But we’ve done that for years already with older MySQL Versions. The reason for that is, that the cluster is right behind a Load Balancer which forwards all connections to one node (active/passive setup basically). So we’ve set the max_connections very high to ensure that the node rather runs into an OOM (and therefore falls out of the Load Balancer Pool) rather than stopping connections (which would lead to a split-brain situation). + We are only having like 60-80 connections usually… if connections rises our monitoring sends out an alert and we’d jump in to handle the situation right in time (which barely ever happens though).

(…)

1 Like

(…)
Bellow you’ll find the calculation and current values:

> mysql> SELECT CONCAT((@@key_buffer_size + (@@innodb_buffer_pool_size * 1.05 + 20*1024*1024) + @@innodb_log_buffer_size
>     -> + @@max_connections * (@@read_buffer_size + @@read_rnd_buffer_size + @@sort_buffer_size + @@join_buffer_size + @@binlog_cache_size + @@tmp_table_size
>     -> + @@thread_stack)) / 1024/1024/1024, ' GB') AS "POTENTIAL MEMORY USAGE";
> +-------------------------+
> | POTENTIAL MEMORY USAGE  |
> +-------------------------+
> | 12581.82702636718750 GB |
> +-------------------------+
> 1 row in set (0.00 sec)
> 
> mysql> SELECT @@key_buffer_size;
> +-------------------+
> | @@key_buffer_size |
> +-------------------+
> |        4294967296 |
> +-------------------+
> 1 row in set (0.01 sec)
> 
> mysql> select @@innodb_buffer_pool_size;
> +---------------------------+
> | @@innodb_buffer_pool_size |
> +---------------------------+
> |              343597383680 |
> +---------------------------+
> 1 row in set (0.00 sec)
> 
> mysql> select (@@innodb_buffer_pool_size * 1.05 + 20*1024*1024)
>     -> ;
> +---------------------------------------------------+
> | (@@innodb_buffer_pool_size * 1.05 + 20*1024*1024) |
> +---------------------------------------------------+
> |                                   360798224384.00 |
> +---------------------------------------------------+
> 1 row in set (0.00 sec)
> 
> mysql> select @@innodb_log_buffer_size;
> +--------------------------+
> | @@innodb_log_buffer_size |
> +--------------------------+
> |                 67108864 |
> +--------------------------+
> 1 row in set (0.00 sec)
> 
> mysql> select @@max_connections;
> +-------------------+
> | @@max_connections |
> +-------------------+
> |              1500 |
> +-------------------+
> 1 row in set (0.00 sec)
> 
> mysql> select @@read_buffer_size;
> +--------------------+
> | @@read_buffer_size |
> +--------------------+
> |            4194304 |
> +--------------------+
> 1 row in set (0.00 sec)
> 
> mysql> select @@read_rnd_buffer_size;
> +------------------------+
> | @@read_rnd_buffer_size |
> +------------------------+
> |               16777216 |
> +------------------------+
> 1 row in set (0.00 sec)
> 
> mysql> select @@sort_buffer_size;
> +--------------------+
> | @@sort_buffer_size |
> +--------------------+
> |           16777216 |
> +--------------------+
> 1 row in set (0.00 sec)
> 
> mysql> select @@join_buffer_size;
> +--------------------+
> | @@join_buffer_size |
> +--------------------+
> |          134217728 |
> +--------------------+
> 1 row in set (0.00 sec)
> 
> mysql> select @@binlog_cache_size;
> +---------------------+
> | @@binlog_cache_size |
> +---------------------+
> |               32768 |
> +---------------------+
> 1 row in set (0.00 sec)
> 
> mysql> select @@tmp_table_size;
> +------------------+
> | @@tmp_table_size |
> +------------------+
> |       8589934592 |
> +------------------+
> 1 row in set (0.00 sec)
> 
> mysql> select @@thread_stack;
> +----------------+
> | @@thread_stack |
> +----------------+
> |        1048576 |
> +----------------+
> 1 row in set (0.00 sec)

I had on-call duty this weekend… so I’m gonna catch up some sleep and then I’ll recheck if/which values can be reduced. If you have any recommendations in the meantime I’d be very grateful of course!

Thanks a lot for your input and Best Regards

1 Like

If you have no MyISAM tables, remove key_buffer_size (set to default). Reduce tmp_table_size to 1GB (and make max_heap_table_size the same).

Does dmesg | grep OOM actually show OOM errors? I didn’t see that above anywhere to confirm this is the issue.

rather than stopping connections (which would lead to a split-brain situation)

That’s not how split-brain works. Split-brain is when two nodes cannot talk to each other (has nothing to do with the client) and form 2 separate clusters. You should invest in a proper 3-node setup to prevent this. Even if you spin up a small VM and use garbd, that will help prevent this issue.

1 Like

Hi matthewb,

Does dmesg | grep OOM actually show OOM errors? I didn’t see that above anywhere to confirm this is the issue.

→ nope, no OOM. Usually there are is about 100-200G (or sometimes even more) Memory available when the Signal 11 hits…

unfortunately I didn’t have managed to change anything yet … still reviewing, comparing to the default values, re-evaluating settings … but I wanted to provide you feedback already so you know I’m still on it :slight_smile:

If you have no MyISAM tables, remove key_buffer_size (set to default). Reduce tmp_table_size to 1GB (and make max_heap_table_size the same).

there are still a lot of MyISAM tables in use … we were thinking that this could also be a cause of the problem as in most of the cases MyISAM tables have been identified in the Sig11 Error Message. And also MyISAM Support has been quite reduced in MySQL8 in comparison to previous versions. But with some Sig11 Errors it’s only InnoDB tables … so that was a false track we’ve been on. So long story short - we need the increased values for key_buffer_size, tmp_table_size, etc… due to the still heavy usage of MyISAM :upside_down_face:

here is a the my.cnf by the way - maybe we are completely overlooking some major misconfiguration?! :thinking: 20210630_my.cnf.txt (2.3 KB)

Thansk & Best Regards

1 Like

max-heap-table-size and tmp-table-size should be the same as MySQL will always pick the lower of the two for the true maximum size of temp tables before converting to disk.

Yes, MyISAM is (for all intents and purposes) a dead engine and you should be switching them all to InnoDB or MyRocks, depending on query patterns.

If dmesg isn’t showing any OOM messages, then you are not actually having out-of-memory-style issues. This could very well be a bug but more analysis would need to occur for that. You don’t have a repeatable test case? Or maybe a core dump? Our engineers would need either (or both) to be able to diagnose this.

1 Like

hey mathewb

sorry again for the delay … crazy busy times :unamused:
so I’ve reviewed the settings and there is nothing that I would recognise the root of the evil. Some may not be exactly perfect, but nothing that would explain someting like this. Yes, we overcommit memory but that is intentional and we don’t run into OOM problems as mentioned above :upside_down_face:

The customer is aware that they need to move away from MyISAM. They continuously work on that … but it will take them months + to redesign their application and migrate all their customer data accordingly.

Yes, a core dump would make sense … I just need to figure out how to properly configure it and ensure there is enough space for it as I assume it will become huge. Once I have one available I’ll come back to you to clarify how to provide it to you. Also I need to get the OK from our customer to provide it to you. But I think that is only going to be a formality.

Do you require the core dump to include the buffer pool? Just asking because I think it would save a lot of space to disable @@innodb_buffer_pool_in_core_file :upside_down_face:

About how to reproduce the issue:
The problem is, that the statements differ every time … we haven’t figured out the common denominator might be. I can provide you with all the signal 11 errors we had so far if that helps you/your engineers?

Thanks again for your support and sorry for the delay from my end.

Best Regards

1 Like

Hello @matthewb!

can you let me know if you need the innodb_buffer_pool_in_core_file enabled? :slight_smile:
If yes - the customer would like to know if any data/queries (sensitive data) would end up in that core dump too then. As far as I understand there is a high chance that database pages and therefore sensitive data will end up in the core dump too.

Please confirm/correct, thanks.

BR

1 Like

I don’t believe the buffer pool in the core dump would be necessary. Some data would certainly be in the core dump, as data exists in the buffer pool. No queries though.

1 Like

thanks. I’m waiting for the customer’s OK and then come back to you once I have a core dump available… will probably take a week or two.

1 Like

@um-juho Make sure to submit it on https://jira.percona.com/ for our developers. Do not submit it here on these community forums.

1 Like

Hello @matthewb !

just wanted to let you know that I’ve created PS-7829 a few minutes ago, now that I have a core dump of one of the clusters. Unfortunately the core dump is a “little bit” beyond the JIRA upload limit of 10MB (127G) … so I have to figure out how to provide that core dump in the first place.

Anyways … should we close this thread with reference to PS-7829? As communication will probably happen in the JIRA ticket from now on…

BR

1 Like