Hi. Back again with more issues on upgrading from Percona XtraDB cluster 5.7 to 8.x, Proxysql 1.x to Proxysql 2.x, and Galera 3.x to Galera 4.x.
After upgrading percona xtradb cluster on our 3 nodes, I was able to at least get a somewhat working state. I had to rsync the data directory from one node to another to get it to sync/join the cluster. I thought that was fine, now we’re good. But after upgrading to 2.x, which I had to do because mysql 8 requires it, and reconfiguring proxysql as per this guide: Galera Configuration - ProxySQL I have a somewhat frequent issue where a node will go down and be unable to resync with the cluster when mysqld service is restarted. I then have to rsync the data directory from a working node, which works but is super annoying.
this is the relevant part of the logs
2022-09-19T17:31:09.629157Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address 'IP.62' --datadir '/mnt/data/mysql/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '14665' --mysqld-version '8.0.28-19.1' '' : 2 (No such file or directory)
2022-09-19T17:31:09.630222Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2022-09-19T17:31:09.630319Z 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 2 (No such file or directory)
2022-09-19T17:31:09.630571Z 3 [Note] [MY-000000] [Galera] Processing SST received
2022-09-19T17:31:09.630681Z 3 [Note] [MY-000000] [Galera] SST received: 00000000-0000-0000-0000-000000000000:-1
2022-09-19T17:31:09.630774Z 3 [System] [MY-000000] [WSREP] SST completed
2022-09-19T17:31:09.631283Z 2 [Note] [MY-000000] [Galera] str_proto_ver_: 3 sst_seqno_: -1 cc_seqno: 11760209 req->ist_len(): 74
2022-09-19T17:31:09.631404Z 2 [ERROR] [MY-000000] [Galera] Application received wrong state:
Received: 00000000-0000-0000-0000-000000000000
Required: b392a4b7-a3c8-11e7-b022-632a7cf1c510
The proxysql 2 configuration/runtime does NOT match the guide (link above), and I’m not sure why or how to fix it.
I’m not sure how 2 nodes can be shunned but also online. I’m also not 100% positive this is leading to the SST/galera sync issue, but these are definitely not the behavior described in the guides.
Please let me know if you need more info. Our mysql config is essentially the same as what I’ve posted before:
ProxySQL and SSTs have nothing to do with each other. These are two different issues. Your SST is failing for some reason. What you provided above is incomplete picture of the logs so it is difficult to determine the cause. Please ensure that ports 3306, 4444, 4567, and 4568 are open between all nodes. I suggest you set pxc_encrypt_cluster_traffic=OFF while working through this in order to rule out SSL as yet another point of failure/trouble.
Shut down your cluster. All nodes. Then bootstrap the first. Ensure it is online and that proxysql sees it online and that you can read/write through proxysql connection.
Then, after you do all that above, start node2. Let it IST/SST and sync. If this fails, you have a config issue somewhere that is preventing the SST/IST process. If this completes, check proxysql. Ensure everything is ONLINE. If not, you have a proxysql config issue. Check SELECT hostname, connect_error FROM mysql_server_connect_log ORDER BY time_start_us DESC LIMIT 3 and see if there’s errors connecting to the node.
So regarding the ports and encryption, we already have verified/done that.
I didn’t even get to do your tests yet when one of the nodes went down. It didn’t bring the cluster down at least. Here’s the log:
2022-09-20T14:07:15.769340Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 6: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-09-20T14:08:03.060248Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-09-20T14:08:08.078213Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 1: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-09-20T14:08:25.309815Z 0 [Warning] [MY-000000] [Galera] checksum failed, hdr: len=1 has_crc32=0 has_crc32c=0 crc32=1
2022-09-20T14:08:44.658716Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-09-20T14:08:54.056367Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-09-20T14:08:54.061281Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
2022-09-20T14:09:14.687285Z 0 [Warning] [MY-000000] [Galera] unserialize error invalid protocol version 4: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():133
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::system_error> >'
what(): remote_endpoint: Transport endpoint is not connected
2022-09-20T14:20:37.212537Z 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation
14:20:37 UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Build ID: 197cca034159ea848cfc7c45f97087bb0d9c0428
Server Version: 8.0.28-19.1 Percona XtraDB Cluster (GPL), Release rel19, Revision f544540, WSREP version 26.4.3, wsrep_26.4.3
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x20ef01d]
/usr/sbin/mysqld(print_fatal_signal(int)+0x323) [0x1182a03]
/usr/sbin/mysqld(handle_fatal_signal+0xc0) [0x1182ad0]
/lib64/libpthread.so.0(+0xf630) [0x7fa95ee10630]
/lib64/libc.so.6(gsignal+0x37) [0x7fa95d0fb387]
/lib64/libc.so.6(abort+0x148) [0x7fa95d0fca78]
/lib64/libstdc++.so.6(__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fa95da0ba95]
/lib64/libstdc++.so.6(+0x5ea06) [0x7fa95da09a06]
/lib64/libstdc++.so.6(+0x5ea33) [0x7fa95da09a33]
/lib64/libstdc++.so.6(+0x5ec53) [0x7fa95da09c53]
/usr/lib64/galera4/libgalera_smm.so(+0x1dbea) [0x7fa94d6d1bea]
/usr/lib64/galera4/libgalera_smm.so(+0x94748) [0x7fa94d748748]
/usr/lib64/galera4/libgalera_smm.so(+0xac341) [0x7fa94d760341]
/usr/lib64/galera4/libgalera_smm.so(+0xa3dab) [0x7fa94d757dab]
/usr/lib64/galera4/libgalera_smm.so(+0xa703a) [0x7fa94d75b03a]
/usr/lib64/galera4/libgalera_smm.so(+0xae7ef) [0x7fa94d7627ef]
/usr/lib64/galera4/libgalera_smm.so(+0x8c8d0) [0x7fa94d7408d0]
/usr/lib64/galera4/libgalera_smm.so(+0x1c64ee) [0x7fa94d87a4ee]
/usr/lib64/galera4/libgalera_smm.so(+0x1c6612) [0x7fa94d87a612]
/lib64/libpthread.so.0(+0x7ea5) [0x7fa95ee08ea5]
/lib64/libc.so.6(clone+0x6d) [0x7fa95d1c3b0d]
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
Can you disable proxysql completely (shut it down. leave it offline.) and start up your cluster without it? Run your cluster without proxysql for a bit and see if that fixes your issue. Also, all 3 of your nodes are 8.0.28? You have something strange configured and we just need to isolate what that is. Are you running queries when it crashes? Are all 3 online then suddenly one crashes, then the others?
How can we have a working/functional cluster without proxysql running? We have a separate server it runs on that receives requests. How would we make db queries if it’s not running?
We were on 8.0.28. Recently upgraded to 8.0.29-21.1 at the end of last week.
As far as I can tell no one is running queries when the nodes decide to go down, and they tend to crash within an hour of each other, with same general output to logs.
Last crash times:
Tue 2022-09-27 10:46:13
Tue 2022-09-27 10:47:26
Tue 2022-09-27 11:09:55
(Crashes happen every few days to a week, sometimes just one node goes down, other times all 3)
Configure your application to connect directly to one of the nodes. If all works fine after doing that, then you know ProxySQL is acting funny and causing some issue that needs further investigation. If you still get crashes, then we need to keep digging.
Have you looked at proxysql’s error logs around the same time as the crashes?