Random crashes on 3 node cluster, ubuntu 20.04

Hi,

I have a 3 node cluster running all 3 is ubuntu 20.04.
2 nodes are running mysql Ver 8.0.27-18.1 for Linux on x86_64 (Percona XtraDB Cluster (GPL), Release rel18, Revision ac35177, WSREP version 26.4.3) but my 3rd node recently just added is.
mysql Ver 8.0.29-21.1 for Linux on x86_64 (Percona XtraDB Cluster (GPL), Release rel21, Revision 250bc93, WSREP version 26.4.3)

I recently got a crash on node 1 and looking in the error log i see this:

2022-10-18T07:02:52.584476Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 6: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:03:50.554441Z 0 [Note] [MY-000000] [Galera] (b48dc567-8a61, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT3S, socket stats: rtt: 1477 rttvar: 554 rto: 204000 lost: 0 last_data_recv: 3360 cwnd: 10 last_queued_since: 3366939103 last_delivered_since: 10806297790912744 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 (gmcast.peer_timeout)
2022-10-18T07:03:54.055310Z 0 [Note] [MY-000000] [Galera] (b48dc567-8a61, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT3S, socket stats: rtt: 1607 rttvar: 605 rto: 204000 lost: 0 last_data_recv: 3492 cwnd: 10 last_queued_since: 3497177238 last_delivered_since: 10806301291783721 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 (gmcast.peer_timeout)
2022-10-18T07:03:57.555817Z 0 [Note] [MY-000000] [Galera] (b48dc567-8a61, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr  timed out, no messages seen in PT3S, socket stats: rtt: 1678 rttvar: 695 rto: 204000 lost: 0 last_data_recv: 3492 cwnd: 10 last_queued_since: 3496236840 last_delivered_since: 10806304792285149 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 (gmcast.peer_timeout)
2022-10-18T07:05:09.036524Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:05:14.038314Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 1: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:07:39.852581Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:08:17.898145Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:08:17.902549Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:09:24.907731Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 4: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():133
terminate called after throwing an instance of 'boost::wrapexcept<std::system_error>'
  what():  remote_endpoint: Transport endpoint is not connected
2022-10-18T07:13:01.091653Z 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation
07:13:01 UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.

Build ID: 2165eff2f1909b2f032b76b423382ec097755ae3
Server Version: 8.0.27-18.1 Percona XtraDB Cluster (GPL), Release rel18, Revision ac35177, WSREP version 26.4.3, wsrep_26.4.3

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x555fc9542ea1]
/usr/sbin/mysqld(handle_fatal_signal+0x393) [0x555fc8561f63]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7f273d398420]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7f273ca7000b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b) [0x7f273ca4f859]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911) [0x7f273ce27911]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c) [0x7f273ce3338c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7) [0x7f273ce333f7]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9) [0x7f273ce336a9]
/usr/lib/libgalera_smm.so(+0x1e569) [0x7f2730b1a569]
/usr/lib/libgalera_smm.so(+0xa050a) [0x7f2730b9c50a]
/usr/lib/libgalera_smm.so(+0xa20ab) [0x7f2730b9e0ab]
/usr/lib/libgalera_smm.so(+0xa4428) [0x7f2730ba0428]
/usr/lib/libgalera_smm.so(+0xaa9d3) [0x7f2730ba69d3]
/usr/lib/libgalera_smm.so(+0x9b707) [0x7f2730b97707]
/usr/lib/libgalera_smm.so(+0x893a2) [0x7f2730b853a2]
/usr/lib/libgalera_smm.so(+0x193258) [0x7f2730c8f258]
/usr/lib/libgalera_smm.so(+0x1bc5ae) [0x7f2730cb85ae]
/usr/lib/libgalera_smm.so(+0x1bc6d6) [0x7f2730cb86d6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f273d38c609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f273cb4c133]
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.

The process did crash at 2022-10-18T07:13:01.091653Z
I have no idea why it crashes, the log doesn’t say specifically.

1 Like

That message usually appears when something external is probing the TCP port used by Galera (4567 by default) communication.

You can try out to reproduce it by telneting the port or scan it with nmap or similar.

If all cluster nodes run with same version of Galera plugin, that is unlikely to happen due to any legitimate cluster activity.

1 Like

Hi,

I am not aware of anything else using that port at all.
So you suggest to telnet to the port or nmap when it possible crashes again?
or .e.g doing lsof -i :4567 to see what’s listening on the port?

All 3 nodes are installed the same way.
Though node 1 & 2 has been under an huge upgrade from ubuntu 16.04 up until 20.04.
Where node 3, is a fresh install of 20.04.

They do have the same config file setup.

1 Like

These seem to be the same issues
https://forums.percona.com/t/mysql-crash-boost-wrapexcept-std-system-error/16573
https://forums.percona.com/t/looks-like-bug-to-many-connection-crashes-pxc/17920

1 Like

Actually, I think it is related to some kind of port scanner.
Assume I have 2 nodes cluster and their UDP ports are 4030 and 5030 respectively (both nodes on the same host) (the default one is 4567)
Now start some sysbench load on node.
In parallel do the port scan

while true; do nmap -p5030,4030 localhost; done

One of the nodes will end up with the log shown initially.

My proposal is to follow this article https://docs.percona.com/percona-xtradb-cluster/8.0/security/secure-network.html and setup iptables rules that allow communication between cluster nodes, but drop all port scans

1 Like

We do have Nessus, and it did run a scan yesterday.
I can’t see the exact node in my result of hosts, probably because of Network Congestion Detected.

1 Like

Hi @Kamil_Holubicki

My issue still persists and whenever Nesus Scans runs services go down of that node. Just now I tried to upgrade my galera to 4.12 and restarted the Scan, just hoping that it doesn’t go down.

I also tried to give maximum open_file_limit and also made max_connection = 1000, even that didn’t help
all hope is on the Galera version upgrade (referring to below jira link which mentions nesus scans crashes galera version 26.4.3 and fix is 26.4.12)
https://jira.mariadb.org/browse/MDEV-25068

But I think, for some Nessus impact the services and for other’s don’t because of the template used to Scan, in our scenario Nessus also runs some kind of Pen testing as per the CIS benchmark because of which at some point services goes down.

@Simon_Karberg What do you see in your general logs during scanning?

1 Like

how do you upgrade galera ?

I don’t have general_log enabled in my mysql cluster, too much “noise”.

1 Like

I have documented it today only in another forum but waiting for confirmation from someone if this is the best way to do it.

But also not sure if this is the solution as Scan is still running, I am just sitting with my fingers crossed till the time scan completes for services crashes.

1 Like

ya ok.
so it’s simply bundled into the .rpm package in your case.
and in my case it’s in my .deb package.

my latest node is:

mysql> show status like 'wsrep_provider_version';
+------------------------+---------------+
| Variable_name          | Value         |
+------------------------+---------------+
| wsrep_provider_version | 4.12(04bfb95) |
+------------------------+---------------+
1 row in set (0.01 sec)

mysql> select @@version;
+-------------+
| @@version   |
+-------------+
| 8.0.29-21.1 |
+-------------+
1 row in set (0.00 sec)

my 2 other nodes are:

mysql> show status like 'wsrep_provider_version';
+------------------------+---------------+
| Variable_name          | Value         |
+------------------------+---------------+
| wsrep_provider_version | 4.10(9728532) |
+------------------------+---------------+
1 row in set (0.02 sec)

mysql> select @@version;
+-------------+
| @@version   |
+-------------+
| 8.0.27-18.1 |
+-------------+
1 row in set (0.01 sec)

but then i don’t know if wsrep_provider_version is the right one to look at it, since you talk about the lib_smm.so file to copy :thinking:

1 Like

I am just sitting with my fingers crossed till the time scan completes for services crashes.

Check my answer where I mentioned nmap. Just provide your IPs/Ports and run the loop from some host outside the cluster, (but with a possible route).
In my opinion, updating galera lib will not help. Only isolating port scanner from the cluster will do the job.

1 Like

so you are saying that i should do a “static” nmap against my mysql instead of waiting for e.g. nessus to scan the host(s) ?
i don’t see what that changes other than you control when the scan is running :thinking:

1 Like

I mean with nmap you can simulate the Nessus scanning behavior. And be sure that you scan these particular ports. I assume Nessus does much more and you have no control when it tries PXC ports, so the probability of triggering the problem is much less.

Once triggered, set up your iptables and do the test with nmap again. If it is OK, it is highly probable that Nessus will not do any harm as well.

1 Like

Hi @Kamil_Holubicki @Simon_Karberg,
This is to update you looks like my issue is fixed by upgrading Galera version as was able to run the scan 2 times now but still keeping it under the radar if something goes left. also will enable selinux and run scan again. however scan was failing with or without selinux.

https://jira.mariadb.org/browse/MDEV-25068

@Simon_Karberg The reason I have to copy libgalera.so file from 8.0.29 version env as garbd rpm don’t provide this file it only provides /usr/bin/garbd, /etc/sysconfig/garbd and a few other files, as this is provided by percona-xtradb-cluster-server RPM so I have to install this RPM on another machine as it can disturb the current env and copy the file from that env to 8.0.28 env.

Thanks
Adi

1 Like

Note: Below steps were only not used for the solution.
Also I tried to install Galera-4 RPM from codership to check what it provides and during it’s installation it’s dependency was boost-program-options RPM and if we see in our errors it’s same ‘boost::wrapexceptstd::system_error, but during installation of garbd rpm from percona, this dependency was not installed but I believe it was already packaged in.

Thanks
Adi

1 Like