I have a 3 node cluster running all 3 is ubuntu 20.04.
2 nodes are running mysql Ver 8.0.27-18.1 for Linux on x86_64 (Percona XtraDB Cluster (GPL), Release rel18, Revision ac35177, WSREP version 26.4.3) but my 3rd node recently just added is. mysql Ver 8.0.29-21.1 for Linux on x86_64 (Percona XtraDB Cluster (GPL), Release rel21, Revision 250bc93, WSREP version 26.4.3)
I recently got a crash on node 1 and looking in the error log i see this:
2022-10-18T07:02:52.584476Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 6: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:03:50.554441Z 0 [Note] [MY-000000] [Galera] (b48dc567-8a61, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr timed out, no messages seen in PT3S, socket stats: rtt: 1477 rttvar: 554 rto: 204000 lost: 0 last_data_recv: 3360 cwnd: 10 last_queued_since: 3366939103 last_delivered_since: 10806297790912744 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 (gmcast.peer_timeout)
2022-10-18T07:03:54.055310Z 0 [Note] [MY-000000] [Galera] (b48dc567-8a61, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr timed out, no messages seen in PT3S, socket stats: rtt: 1607 rttvar: 605 rto: 204000 lost: 0 last_data_recv: 3492 cwnd: 10 last_queued_since: 3497177238 last_delivered_since: 10806301291783721 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 (gmcast.peer_timeout)
2022-10-18T07:03:57.555817Z 0 [Note] [MY-000000] [Galera] (b48dc567-8a61, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr timed out, no messages seen in PT3S, socket stats: rtt: 1678 rttvar: 695 rto: 204000 lost: 0 last_data_recv: 3492 cwnd: 10 last_queued_since: 3496236840 last_delivered_since: 10806304792285149 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0 (gmcast.peer_timeout)
2022-10-18T07:05:09.036524Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:05:14.038314Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 1: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:07:39.852581Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:08:17.898145Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:08:17.902549Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-10-18T07:09:24.907731Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 4: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
terminate called after throwing an instance of 'boost::wrapexcept<std::system_error>'
what(): remote_endpoint: Transport endpoint is not connected
2022-10-18T07:13:01.091653Z 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation
07:13:01 UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Build ID: 2165eff2f1909b2f032b76b423382ec097755ae3
Server Version: 8.0.27-18.1 Percona XtraDB Cluster (GPL), Release rel18, Revision ac35177, WSREP version 26.4.3, wsrep_26.4.3
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x555fc9542ea1]
/usr/sbin/mysqld(handle_fatal_signal+0x393) [0x555fc8561f63]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7f273d398420]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7f273ca7000b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b) [0x7f273ca4f859]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911) [0x7f273ce27911]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c) [0x7f273ce3338c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7) [0x7f273ce333f7]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9) [0x7f273ce336a9]
/usr/lib/libgalera_smm.so(+0x1e569) [0x7f2730b1a569]
/usr/lib/libgalera_smm.so(+0xa050a) [0x7f2730b9c50a]
/usr/lib/libgalera_smm.so(+0xa20ab) [0x7f2730b9e0ab]
/usr/lib/libgalera_smm.so(+0xa4428) [0x7f2730ba0428]
/usr/lib/libgalera_smm.so(+0xaa9d3) [0x7f2730ba69d3]
/usr/lib/libgalera_smm.so(+0x9b707) [0x7f2730b97707]
/usr/lib/libgalera_smm.so(+0x893a2) [0x7f2730b853a2]
/usr/lib/libgalera_smm.so(+0x193258) [0x7f2730c8f258]
/usr/lib/libgalera_smm.so(+0x1bc5ae) [0x7f2730cb85ae]
/usr/lib/libgalera_smm.so(+0x1bc6d6) [0x7f2730cb86d6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f273d38c609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f273cb4c133]
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
The process did crash at 2022-10-18T07:13:01.091653Z
I have no idea why it crashes, the log doesnât say specifically.
I am not aware of anything else using that port at all.
So you suggest to telnet to the port or nmap when it possible crashes again?
or .e.g doing lsof -i :4567 to see whatâs listening on the port?
All 3 nodes are installed the same way.
Though node 1 & 2 has been under an huge upgrade from ubuntu 16.04 up until 20.04.
Where node 3, is a fresh install of 20.04.
Actually, I think it is related to some kind of port scanner.
Assume I have 2 nodes cluster and their UDP ports are 4030 and 5030 respectively (both nodes on the same host) (the default one is 4567)
Now start some sysbench load on node.
In parallel do the port scan
while true; do nmap -p5030,4030 localhost; done
One of the nodes will end up with the log shown initially.
We do have Nessus, and it did run a scan yesterday.
I canât see the exact node in my result of hosts, probably because of Network Congestion Detected.
My issue still persists and whenever Nesus Scans runs services go down of that node. Just now I tried to upgrade my galera to 4.12 and restarted the Scan, just hoping that it doesnât go down.
I also tried to give maximum open_file_limit and also made max_connection = 1000, even that didnât help
all hope is on the Galera version upgrade (referring to below jira link which mentions nesus scans crashes galera version 26.4.3 and fix is 26.4.12) https://jira.mariadb.org/browse/MDEV-25068
But I think, for some Nessus impact the services and for otherâs donât because of the template used to Scan, in our scenario Nessus also runs some kind of Pen testing as per the CIS benchmark because of which at some point services goes down.
@Simon_Karberg What do you see in your general logs during scanning?
I have documented it today only in another forum but waiting for confirmation from someone if this is the best way to do it.
But also not sure if this is the solution as Scan is still running, I am just sitting with my fingers crossed till the time scan completes for services crashes.
I am just sitting with my fingers crossed till the time scan completes for services crashes.
Check my answer where I mentioned nmap. Just provide your IPs/Ports and run the loop from some host outside the cluster, (but with a possible route).
In my opinion, updating galera lib will not help. Only isolating port scanner from the cluster will do the job.
so you are saying that i should do a âstaticâ nmap against my mysql instead of waiting for e.g. nessus to scan the host(s) ?
i donât see what that changes other than you control when the scan is running
I mean with nmap you can simulate the Nessus scanning behavior. And be sure that you scan these particular ports. I assume Nessus does much more and you have no control when it tries PXC ports, so the probability of triggering the problem is much less.
Once triggered, set up your iptables and do the test with nmap again. If it is OK, it is highly probable that Nessus will not do any harm as well.
Hi @Kamil_Holubicki@Simon_Karberg,
This is to update you looks like my issue is fixed by upgrading Galera version as was able to run the scan 2 times now but still keeping it under the radar if something goes left. also will enable selinux and run scan again. however scan was failing with or without selinux.
@Simon_Karberg The reason I have to copy libgalera.so file from 8.0.29 version env as garbd rpm donât provide this file it only provides /usr/bin/garbd, /etc/sysconfig/garbd and a few other files, as this is provided by percona-xtradb-cluster-server RPM so I have to install this RPM on another machine as it can disturb the current env and copy the file from that env to 8.0.28 env.
Note: Below steps were only not used for the solution.
Also I tried to install Galera-4 RPM from codership to check what it provides and during itâs installation itâs dependency was boost-program-options RPM and if we see in our errors itâs same âboost::wrapexceptstd::system_errorâ, but during installation of garbd rpm from percona, this dependency was not installed but I believe it was already packaged in.