This happens on each server within the same hour as each other, eventually taking our entire cluster down.
Here’s the full log from one server from when the error begins:
2022-08-02T14:37:33.586089Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 6: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-08-02T14:38:20.975490Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-08-02T14:38:25.986377Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 1: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-08-02T14:38:43.165347Z 0 [Warning] [MY-000000] [Galera] checksum failed, hdr: len=1 has_crc32=0 has_crc32c=0 crc32=1
2022-08-02T14:38:57.100836Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-08-02T14:38:57.104978Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-08-02T14:39:00.665961Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-08-02T14:39:33.968910Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 4: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::system_error> >'
what(): remote_endpoint: Transport endpoint is not connected
2022-08-02T14:50:43.154269Z 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation
14:50:43 UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Build ID: 197cca034159ea848cfc7c45f97087bb0d9c0428
Server Version: 8.0.28-19.1 Percona XtraDB Cluster (GPL), Release rel19, Revision f544540, WSREP version 26.4.3, wsrep_26.4.3
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x20ef01d]
/usr/sbin/mysqld(print_fatal_signal(int)+0x323) [0x1182a03]
/usr/sbin/mysqld(handle_fatal_signal+0xc0) [0x1182ad0]
/lib64/libpthread.so.0(+0xf630) [0x7efd2cba9630]
/lib64/libc.so.6(gsignal+0x37) [0x7efd2ae94387]
/lib64/libc.so.6(abort+0x148) [0x7efd2ae95a78]
/lib64/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7efd2b7a4a95]
/lib64/libstdc++.so.6(+0x5ea06) [0x7efd2b7a2a06]
/lib64/libstdc++.so.6(+0x5ea33) [0x7efd2b7a2a33]
/lib64/libstdc++.so.6(+0x5ec53) [0x7efd2b7a2c53]
/usr/lib64/galera4/libgalera_smm.so(+0x1dbea) [0x7efd1b46abea]
/usr/lib64/galera4/libgalera_smm.so(+0x94748) [0x7efd1b4e1748]
/usr/lib64/galera4/libgalera_smm.so(+0xac341) [0x7efd1b4f9341]
/usr/lib64/galera4/libgalera_smm.so(+0xa3dab) [0x7efd1b4f0dab]
/usr/lib64/galera4/libgalera_smm.so(+0xa703a) [0x7efd1b4f403a]
/usr/lib64/galera4/libgalera_smm.so(+0xae7ef) [0x7efd1b4fb7ef]
/usr/lib64/galera4/libgalera_smm.so(+0x8c8d0) [0x7efd1b4d98d0]
/usr/lib64/galera4/libgalera_smm.so(+0x1c64ee) [0x7efd1b6134ee]
/usr/lib64/galera4/libgalera_smm.so(+0x1c6612) [0x7efd1b613612]
/lib64/libpthread.so.0(+0x7ea5) [0x7efd2cba1ea5]
/lib64/libc.so.6(clone+0x6d) [0x7efd2af5cb0d]
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
Our hardware has not changed. I have upgraded Percona Xtradb cluster from 5.7 to 8 and then proxysql from 1.x to 2.3.2.
These errors started happening after upgrading to proxysql 2.3.2.
I should note there are NO logged errors on the proxysql server. (There used to errors due to a scheduler we had running for 1.x, but I have since removed it).
Any idea what we can change in our config to stop this from happening? Thanks