Cannot Get SST Xtrabackup Transfer to Work

I cannot get SST Transfer using Xtrabackup-v2 to work an successfully.

I’ve verified that the wsrep_sst_auth username and password are correct and used them to manually log into mysql.

I’ve verified that port 4444 is open by using socat to test from both the donor and joiner.

I’ve scoured the web for solutions and haven’t found anything. This is on a production cluster and I need to get the other nodes back online. I can do it with rsync but that takes the primary offline for several hours. Any ideas would be greatly appreciated.

The error I am receiving on the Donor is:

2023-01-11T06:44:55.643495Z 357 [Note] Aborted connection 357 to db: ‘unconnected’ user: ‘sstuser’ host: ‘localhost’ (Got an error reading communication packets)
2023-01-11T06:44:55.653881Z 0 [ERROR] WSREP: Process was aborted.
2023-01-11T06:44:55.653960Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘donor’ --address ‘192.168.2.122:4444/xtrabackup_sst//1’ --socket ‘/var/run/mysqld/mysqld.sock’ --datadir ‘/data/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ‘’ --mysqld-version ‘5.7.40-43-57’ ‘’ --gtid ‘c4d90d61-5613-11e9-b5a9-cba71753d598:155847001’ : 2 (No such file or directory)
2023-01-11T06:44:55.654049Z 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role ‘donor’ --address ‘192.168.2.122:4444/xtrabackup_sst//1’ --socket ‘/var/run/mysqld/mysqld.sock’ --datadir ‘/data/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ‘’ --mysqld-version ‘5.7.40-43-57’ ‘’ --gtid ‘c4d90d61-5613-11e9-b5a9-cba71753d598:155847001’
2023-01-11T06:44:55.654095Z 0 [Warning] WSREP: Could not find peer: f5c7f3b7-917a-11ed-99e4-f77f42862c79
2023-01-11T06:44:55.654117Z 0 [Warning] WSREP: 0.0 (iGradePlus-DB-01): State transfer to -1.-1 (left the group) failed: -2 (No such file or directory)

The error on the joiner side is:

2023-01-11T06:41:26.622845Z 0 [Note] [MY-000000] [WSREP-SST] Proceeding with SST…
2023-01-11T06:41:26.695051Z 0 [Note] [MY-000000] [WSREP-SST] …Waiting for SST streaming to complete!
2023-01-11T06:43:32.222988Z 0 [ERROR] [MY-000000] [WSREP-SST] Killing SST (33925) with SIGKILL after stalling for 120 seconds
2023-01-11T06:43:32.259587Z 0 [Note] [MY-000000] [WSREP-SST] /usr/bin/wsrep_sst_xtrabackup-v2: line 185: 33927 Killed socat -u TCP-LISTEN:4444,reuseaddr,pf=ip6,retry=30 stdio
2023-01-11T06:43:32.259670Z 0 [Note] [MY-000000] [WSREP-SST] 33928 | pigz -d
2023-01-11T06:43:32.259695Z 0 [Note] [MY-000000] [WSREP-SST] 33929 | /usr/bin/pxc_extra/pxb-8.0/bin/xbstream -x
2023-01-11T06:43:32.260202Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR **********************
2023-01-11T06:43:32.260306Z 0 [ERROR] [MY-000000] [WSREP-SST] Error while getting data from donor node: exit codes: 137 137 137
2023-01-11T06:43:32.260394Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 1316
2023-01-11T06:43:32.260644Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************************************************
2023-01-11T06:43:32.262787Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:32
2023-01-11T06:43:32.334658Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘192.168.2.122’ --datadir ‘/data/mysql/’ --basedir ‘/usr/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ‘’ --parent ‘33403’ --mysqld-version ‘8.0.30-22.1’ ‘’ : 32 (Broken pipe)
2023-01-11T06:43:32.334811Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2023-01-11T06:43:32.334846Z 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 32 (Broken pipe)
2023-01-11T06:43:32.334956Z 3 [Note] [MY-000000] [Galera] Processing SST received
2023-01-11T06:43:32.335012Z 3 [Note] [MY-000000] [Galera] SST request was cancelled
2023-01-11T06:43:32.335082Z 3 [ERROR] [MY-000000] [Galera] State transfer request failed unrecoverably: 32 (Broken pipe). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
2023-01-11T06:43:32.335114Z 3 [Note] [MY-000000] [Galera] ReplicatorSMM::abort()
2023-01-11T06:43:32.335137Z 3 [Note] [MY-000000] [Galera] Closing send monitor…
2023-01-11T06:43:32.335160Z 3 [Note] [MY-000000] [Galera] Closed send monitor

1 Like

Do you have by some chance apparmor or selinux enabled?

1 Like

It seems you are using the different mysqld version between the PXC nodes, which is not supportable.

Doner : --mysqld-version ‘5.7.40-43-57’

Joiner : --mysqld-version ‘8.0.30-22.1’

You need to have the same database version between the PXC nodes.

1 Like

Yes and the thing Abhinav mentioned is true as well. Good catch :slight_smile:

1 Like

Thanks, but that is also disabled.

1 Like

Thanks. Yes. After not having any luck with the 5.7 nodes, I upgraded one of the nodes.

But I do get the exact same problem with my 5.7 nodes.

1 Like

It would be good if you use same version and then post the error and we continue to check the problem.
Checking it in the different version is itself a problem.
I would suggest to keep it either 8.0 or 5.7 on all the nodes.

1 Like

What has me confused is the 8.0 documentation explicitly says you can upgrade a cluster from 5.7 to 8.0 with mixed nodes:

Scenario: No active parallel workload or with read-only workload

(Upgrading Percona XtraDB Cluster - Percona XtraDB Cluster)

If there is no active parallel workload or the cluster has read-only workload while upgrading the nodes, complete the following procedure for each node in the cluster:

  1. Shutdown one of the node 5.7 cluster nodes.
  2. Remove 5.7 PXC packages without removing the data-directory.
  3. Install PXC 8.0 packages.
  4. Restart the mysqld service.

This upgrade flow auto-detects the presence of the 5.7 data directory and trigger the upgrade as part of the node bootup process. The data directory is upgraded to be compatible with PXC 8.0. Then the node joins the cluster and enters synced state. The 3-node cluster is restored with 2 nodes running PXC 5.7 and 1 node running PXC 8.0.

That is the scenario of the already existing working cluster that you are looking for upgrade. In general it will work but make your situation more complex.
Since you are stuck at installation level itself it would be better you try with the latest version itself and continue working on it.

This will be more simpler to understand and work instead of working with complex setup.
You can try setting it up with same version and post us the error and we can take it from there.

Here’s the error produced by the joiner running same version:

2023-01-13T12:28:32.557580Z 2 [Note] WSREP: Requesting state transfer: success, donor: 0
2023-01-13T12:28:32.557613Z 2 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 → c4d90d61-5613-11e9-b5a9-cba71753d598:156254818
2023-01-13T12:28:33.573050Z 0 [Note] WSREP: (c9777403, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2023-01-13T12:28:42.167996Z WSREP_SST: [INFO] Streaming with xbstream
2023-01-13T12:28:42.186704Z WSREP_SST: [INFO] WARNING: Stale temporary SST directory: /data/mysql//.sst from previous state transfer. Removing
2023-01-13T12:28:42.199076Z WSREP_SST: [INFO] Proceeding with SST…
2023-01-13T12:28:42.228350Z WSREP_SST: [INFO] …Waiting for SST streaming to complete!
2023-01-13T12:30:47.356204Z WSREP_SST: [ERROR] Killing SST (24375) with SIGKILL after stalling for 120 seconds
/usr/bin/wsrep_sst_xtrabackup-v2: line 195: 24377 Killed socat -u TCP-LISTEN:4444,reuseaddr,retry=30 stdio
24378 | xbstream -x $xbstream_eopts
2023-01-13T12:30:47.383794Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
2023-01-13T12:30:47.387630Z WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 137 137
2023-01-13T12:30:47.392131Z WSREP_SST: [ERROR] ******************************************************
2023-01-13T12:30:47.398228Z WSREP_SST: [ERROR] Cleanup after exit with status:32
2023-01-13T12:30:47.437868Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘192.168.2.121’ --datadir ‘/data/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ‘’ --parent ‘23793’ --mysqld-version ‘5.7.40-43-57’ ‘’ : 32 (Broken pipe)
2023-01-13T12:30:47.437936Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2023-01-13T12:30:47.437957Z 0 [ERROR] WSREP: SST script aborted with error 32 (Broken pipe)
2023-01-13T12:30:47.438060Z 0 [ERROR] WSREP: SST failed: 32 (Broken pipe)
2023-01-13T12:30:47.438103Z 0 [ERROR] Aborting

1 Like

Can you please post us donor logs as well now?
Also, see if you find on the donor’s node data directory this file → innobackup.backup.log
check if you have any errors there.
It still seems to be a connection problem.

1 Like

I can’t at this point. I’m down to one node in my cluster and last time I tried to join a node with xtradbackup-v2 as the method, the primary got stuck in the donor state indefinitely after the joiner shutdown due to the error. A restart of the primary from this state caused a full crash recovery which took my service offline for 3+ hours. Not good.

I can however tell you that the innobackup.backup.log had no errors in it and the donor log produced no useful information.

Here’s what I did discover, so maybe this will shed some light on the matter:

I used two of the down nodes to form a separate cluster for testing, using a much smaller amount of data. SST worked fine using the xtrabackup method in this case. So the problem isn’t a communication problem between the nodes.

My production database is multi-tenant and has about 8000 individual databases. I don’t know what xtrabackup is doing when SST starts, but I suspect all those individual databases are causing the SST process to timeout. I’ve considered increasing the sst-idle-timeout to something very large, but it’s not clear from the documentation if this parameter is playing a role in the problem and I can’t risk just guessing and crashing the server again.

Just as a note, SST works fine with rsync. I just can’t afford to have my production server offline for the 4 hours it takes it to complete. Several crash recoveries due to the hung xtrabackup SST have already caused my service to go offline multiple times in the last week.

1 Like