We just upgraded a small cluster (2 nodes + 1 garbd) from 5.7.39-31.61-1.focal to 5.7.40-31.63-1.focal.
We are running ubuntu 20.04, all OS packages up to date.
After the update we’ve had the cluster die on us twice y 4 days.
On both occasions the error was the same:
First one of the nodes dies suddently with a signal 11
11:59:46 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://jira.percona.com/projects/PXC/issues
key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=88
max_threads=100001
thread_count=29
connection_count=12
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 38414119 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Build ID: Not Available
Server Version: 5.7.40-43-57-log Percona XtraDB Cluster (GPL), Release rel43, Revision ab4d0bd, WSREP version 31.63, wsrep_31.63
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x40)[0x5580c7adff50]
/usr/sbin/mysqld(handle_fatal_signal+0x589)[0x5580c78e2509]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f1a7335e420]
/usr/sbin/mysqld(pfs_start_mutex_wait_v1+0x4)[0x5580c7afed84]
/usr/sbin/mysqld(+0xcd1381)[0x5580c78be381]
/usr/sbin/mysqld(pfs_spawn_thread+0x168)[0x5580c7afe398]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f1a73352609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f1a72b55133]
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
Some time later, the other node shutdowns with this message:
2023-01-30T12:11:25.568649Z 0 [Note] WSREP: Received shutdown signal. Will sleep for 10 secs before initiating shutdown. pxc_maint_mode switched to SHUTDOWN
We have downgraded the update on both servers, and we also noticed the garbd process was dead, so no arbiter was available (we have already added monitoring for this).
Can that be the reason (split-brain protection) why the cluster did the autoshutdown on node 2?
Hi @nublaii welcome back to the Percona forums!
Sorry to hear your upgrade did not go smoothly for a minor version upgrade.
The stack trace isn’t clear as to the source of the crash. Did you check dmesg
to see if this was due to out of memory condition OOM?
Do you have PMM connected to the cluster? We can perform deeper inspection of the period prior to the crash with some visibility on the PXC members activity.
The other nodes seem to have received a clean signal to shutdown. In the case of 1 node receiving signal 11, the other 2 nodes should NOT shut down, they should detect the failed member, and re-form PRIMARY around the remaining two instances.
Split brain can only happen after a failure when you have an even number of members, such as when you have 2 remaining instances. In my experience this will not shut down the nodes however they will move into NON-PRIMARY state, and not accept any queries.
Hi, thanks for the quick reply!
As I noted we have only 2 nodes on the cluster + 1 garbd arbiter.
Garbd was dead, so I guess this explains both servers being unavailable?
Both machines (the ones with PXC) are monitored and memory is rarely over 40% and we don’t detect any sudden spike of OOM message anywhere.
We do have PMM, but we’re still on V1. Anything to look for in particular?
Hi @nublaii
There is a PXC Cluster Summary and PXC Node dashboard. I recommend that you review those at the time of the crash for any interesting interactions. Also take a look at the InnoDB Metrics dashboard.
Usually you’ll see a metric climbing or dropping. share with us those images and we can help you through this.
Also do you have anything earlier in the error log before the stack trace? Consider setting log_error_verbosity=3
so that when you attempt the upgrade again, we may see further details. Also since this is PXC please review the logs from all members of the cluster.
One last thought: given that 5.7 is going EOL this year, have you considered investing your energy in making the upgrade to PXC 8.0 ?
Seeing a very similar sigsegv 6 days after updating from Percona-XtraDB-Cluster-57-5.7.39-31.61.1.el7.x86_64 to Percona-XtraDB-Cluster-57-5.7.40-31.63.1.el7.x86_64.
Build ID: 148eb19378344bcb460fa7403c822bb561a86dac
Server Version: 5.7.40-43-57-log Percona XtraDB Cluster (GPL), Release rel43, Revision ab4d0bd, WSREP version 31.63, wsrep_31.63
This cluster usually does not have problems and had been up for several months before updating to 5.7.40 and then crashing 6 days later. 1 node out of 3 crashed (no garbd here).
Only thing in the logs of the 2 other servers is that 1 server had conflict 315 seconds before the crash. It is quite a long time earlier so it should be unrelated? There is nothing interesting in dmesg
, no OOM.
The setting log_error_verbosity
is already set to 3.
The backtrace is indeed very useless?!
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xf8317b]
/usr/sbin/mysqld(handle_fatal_signal+0x505)[0xd93025]
/lib64/libpthread.so.0(+0xf630)[0x7fbf83ff9630]
/usr/sbin/mysqld(pfs_start_mutex_wait_v1+0x10)[0xf9d040]
/usr/sbin/mysqld[0xd6e20a]
/usr/sbin/mysqld(pfs_spawn_thread+0x1b4)[0xf9c3a4]
/lib64/libpthread.so.0(+0x7ea5)[0x7fbf83ff1ea5]
/lib64/libc.so.6(clone+0x6d)[0x7fbf823d4b0d]
We do not have PMM currently. It would be great to update to PXC 8.0 but some of our apps still don’t support it.