I cannot get SST Transfer using Xtrabackup-v2 to work an successfully.
I’ve verified that the wsrep_sst_auth username and password are correct and used them to manually log into mysql.
I’ve verified that port 4444 is open by using socat to test from both the donor and joiner.
I’ve scoured the web for solutions and haven’t found anything. This is on a production cluster and I need to get the other nodes back online. I can do it with rsync but that takes the primary offline for several hours. Any ideas would be greatly appreciated.
The error I am receiving on the Donor is:
2023-01-11T06:44:55.643495Z 357 [Note] Aborted connection 357 to db: ‘unconnected’ user: ‘sstuser’ host: ‘localhost’ (Got an error reading communication packets)
2023-01-11T06:44:55.653881Z 0 [ERROR] WSREP: Process was aborted.
2023-01-11T06:44:55.653960Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘donor’ --address ‘192.168.2.122:4444/xtrabackup_sst//1’ --socket ‘/var/run/mysqld/mysqld.sock’ --datadir ‘/data/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ‘’ --mysqld-version ‘5.7.40-43-57’ ‘’ --gtid ‘c4d90d61-5613-11e9-b5a9-cba71753d598:155847001’ : 2 (No such file or directory)
2023-01-11T06:44:55.654049Z 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role ‘donor’ --address ‘192.168.2.122:4444/xtrabackup_sst//1’ --socket ‘/var/run/mysqld/mysqld.sock’ --datadir ‘/data/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ‘’ --mysqld-version ‘5.7.40-43-57’ ‘’ --gtid ‘c4d90d61-5613-11e9-b5a9-cba71753d598:155847001’
2023-01-11T06:44:55.654095Z 0 [Warning] WSREP: Could not find peer: f5c7f3b7-917a-11ed-99e4-f77f42862c79
2023-01-11T06:44:55.654117Z 0 [Warning] WSREP: 0.0 (iGradePlus-DB-01): State transfer to -1.-1 (left the group) failed: -2 (No such file or directory)
The error on the joiner side is:
2023-01-11T06:41:26.622845Z 0 [Note] [MY-000000] [WSREP-SST] Proceeding with SST…
2023-01-11T06:41:26.695051Z 0 [Note] [MY-000000] [WSREP-SST] …Waiting for SST streaming to complete!
2023-01-11T06:43:32.222988Z 0 [ERROR] [MY-000000] [WSREP-SST] Killing SST (33925) with SIGKILL after stalling for 120 seconds
2023-01-11T06:43:32.259587Z 0 [Note] [MY-000000] [WSREP-SST] /usr/bin/wsrep_sst_xtrabackup-v2: line 185: 33927 Killed socat -u TCP-LISTEN:4444,reuseaddr,pf=ip6,retry=30 stdio
2023-01-11T06:43:32.259670Z 0 [Note] [MY-000000] [WSREP-SST] 33928 | pigz -d
2023-01-11T06:43:32.259695Z 0 [Note] [MY-000000] [WSREP-SST] 33929 | /usr/bin/pxc_extra/pxb-8.0/bin/xbstream -x
2023-01-11T06:43:32.260202Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR **********************
2023-01-11T06:43:32.260306Z 0 [ERROR] [MY-000000] [WSREP-SST] Error while getting data from donor node: exit codes: 137 137 137
2023-01-11T06:43:32.260394Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 1316
2023-01-11T06:43:32.260644Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************************************************
2023-01-11T06:43:32.262787Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:32
2023-01-11T06:43:32.334658Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘192.168.2.122’ --datadir ‘/data/mysql/’ --basedir ‘/usr/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ‘’ --parent ‘33403’ --mysqld-version ‘8.0.30-22.1’ ‘’ : 32 (Broken pipe)
2023-01-11T06:43:32.334811Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2023-01-11T06:43:32.334846Z 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 32 (Broken pipe)
2023-01-11T06:43:32.334956Z 3 [Note] [MY-000000] [Galera] Processing SST received
2023-01-11T06:43:32.335012Z 3 [Note] [MY-000000] [Galera] SST request was cancelled
2023-01-11T06:43:32.335082Z 3 [ERROR] [MY-000000] [Galera] State transfer request failed unrecoverably: 32 (Broken pipe). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
2023-01-11T06:43:32.335114Z 3 [Note] [MY-000000] [Galera] ReplicatorSMM::abort()
2023-01-11T06:43:32.335137Z 3 [Note] [MY-000000] [Galera] Closing send monitor…
2023-01-11T06:43:32.335160Z 3 [Note] [MY-000000] [Galera] Closed send monitor
It would be good if you use same version and then post the error and we continue to check the problem.
Checking it in the different version is itself a problem.
I would suggest to keep it either 8.0 or 5.7 on all the nodes.
If there is no active parallel workload or the cluster has read-only workload while upgrading the nodes, complete the following procedure for each node in the cluster:
Shutdown one of the node 5.7 cluster nodes.
Remove 5.7 PXC packages without removing the data-directory.
Install PXC 8.0 packages.
Restart the mysqld service.
This upgrade flow auto-detects the presence of the 5.7 data directory and trigger the upgrade as part of the node bootup process. The data directory is upgraded to be compatible with PXC 8.0. Then the node joins the cluster and enters synced state. The 3-node cluster is restored with 2 nodes running PXC 5.7 and 1 node running PXC 8.0.
That is the scenario of the already existing working cluster that you are looking for upgrade. In general it will work but make your situation more complex.
Since you are stuck at installation level itself it would be better you try with the latest version itself and continue working on it.
This will be more simpler to understand and work instead of working with complex setup.
You can try setting it up with same version and post us the error and we can take it from there.
Can you please post us donor logs as well now?
Also, see if you find on the donor’s node data directory this file → innobackup.backup.log
check if you have any errors there.
It still seems to be a connection problem.
I can’t at this point. I’m down to one node in my cluster and last time I tried to join a node with xtradbackup-v2 as the method, the primary got stuck in the donor state indefinitely after the joiner shutdown due to the error. A restart of the primary from this state caused a full crash recovery which took my service offline for 3+ hours. Not good.
I can however tell you that the innobackup.backup.log had no errors in it and the donor log produced no useful information.
Here’s what I did discover, so maybe this will shed some light on the matter:
I used two of the down nodes to form a separate cluster for testing, using a much smaller amount of data. SST worked fine using the xtrabackup method in this case. So the problem isn’t a communication problem between the nodes.
My production database is multi-tenant and has about 8000 individual databases. I don’t know what xtrabackup is doing when SST starts, but I suspect all those individual databases are causing the SST process to timeout. I’ve considered increasing the sst-idle-timeout to something very large, but it’s not clear from the documentation if this parameter is playing a role in the problem and I can’t risk just guessing and crashing the server again.
Just as a note, SST works fine with rsync. I just can’t afford to have my production server offline for the 4 hours it takes it to complete. Several crash recoveries due to the hung xtrabackup SST have already caused my service to go offline multiple times in the last week.