Galera node crash on innodb buffer poll resize

I have experienced a Node crash while resizing innodb buffer pool…
cluster node hanged during downsizing operation and after 10 minutes colapsed completely.
I can provide full log but not on public forum - there is a lot of sensitive informations inside
used version :mysqld 5.7.30-33-57-log
anybody have any thoughts what might caused it ? I belived its supposed to be “safe” operation…

2021-06-22T16:28:47.276100Z 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
2021-06-22 18:28:47 0x7fa33e938700 InnoDB: Assertion failure in thread 140339106252544 in file ut0ut.cc line 922
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: MySQL :: MySQL 5.7 Reference Manual :: 14.22.2 Forcing InnoDB Recovery
InnoDB: about forcing recovery.
16:28:47 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://jira.percona.com/projects/PXC/issues

key_buffer_size=25165824
read_buffer_size=131072
max_used_connections=500
max_threads=1001
thread_count=508
connection_count=499
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 423348 K bytes of memory
Hope that’s ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0xee87dc]
/usr/sbin/mysqld(handle_fatal_signal+0x459)[0x7ab149]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7faf37f25890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7faf35eaa067]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7faf35eab448]
/usr/sbin/mysqld[0x7799db]
/usr/sbin/mysqld(_ZN2ib5fatalD1Ev+0x15d)[0x118d8fd]
/usr/sbin/mysqld(srv_error_monitor_thread+0xaf2)[0x112da52]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8064)[0x7faf37f1e064]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7faf35f5d62d]
You may download the Percona XtraDB Cluster operations manual by visiting
Percona XtraDB Cluster – The MySQL Clustering Solution. You may find information
in the manual which will help you identify the cause of the crash.

1 Like

I would say that what you are trying to do is not recommended. Growing the buffer pool is easy, but shrinking is hard. Read the MySQL server team’s blog on this: Resizing the InnoDB Buffer Pool Online | MySQL Server Blog

Basically the shrink took too long and InnoDB crashed itself. This is built-in functionality to protect against infinite stalls. If you want to do this, you’ll need to force flush out dirty pages beforehand, possibly block writes to prevent locking, etc.

Since this is PXC, why not simply do a rolling restart to lower buffer pool? That’s the whole point to running such an HA configuration. I assume you have a load balancer or ProxySQL handling traffic? There should be no outage experienced if everything is configured correctly.

1 Like

actualy there is nothing in the blog saying that downsizing is not recomanded…
actual downsizing was by 4G (32G ->28G … and it took just few seconds on secondary node.)
what was strange was that node practicaly stalled whole time ( many queries were blocked in similar way like when you make alter table on large table).
as to the reload … well I have ProxySQL but not everything is connected there … (we are actualy migrating to that model now but we are not there yet)

but back to the topic … you can see that there is no dramatic activity in the performance …

to me it feels like deadlock scenario … but its deadlock that can bring down whole node …

1 Like

and yeah, next time I will at least offload traffic from node before plaing with this…

1 Like