Xtrabackup: Error writing file '<unopen fd>' (OS errno 32 - Broken pipe) xb_stream_write_data() failed

I have a 3 nodes MySQL (8.0) Galera Cluster.

After a disaster on my server, my cluster now has only two nodes online (DB01 and DB03), one node (DB02) now lost all data files in the data folder. I tried to start the node (with specified wsrep_sst_donor), and it triggers the SST process, but it failed.

This is the log of down node (DB02):

2023-12-30T22:53:27.191267Z 0 [Note] [MY-000000] [Galera] Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 23,
        members    = 2/3 (primary/total),
        act_id     = 74756843,
        last_appl. = 74756765,
        protocols  = 2/10/4 (gcs/repl/appl),
        vote policy= 0,
        group UUID = df948d8d-5994-11ec-91b5-ff5ae1e8c4ba
2023-12-30T22:53:27.191610Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [173, 173]
2023-12-30T22:53:27.191886Z 0 [Note] [MY-000000] [Galera] Shifting OPEN -> PRIMARY (TO: 74756844)
2023-12-30T22:53:27.192242Z 2 [Note] [MY-000000] [Galera] ####### processing CC 74756844, local, ordered
2023-12-30T22:53:27.192577Z 2 [Note] [MY-000000] [Galera] Maybe drain monitors from -1 upto current CC event 74756844 upto:-1
2023-12-30T22:53:27.192862Z 2 [Note] [MY-000000] [Galera] Drain monitors from -1 up to -1
2023-12-30T22:53:27.193155Z 2 [Note] [MY-000000] [Galera] Process first view: df948d8d-5994-11ec-91b5-ff5ae1e8c4ba my uuid: 3de2d256-a766-11ee-851e-0327836c7f5c
2023-12-30T22:53:27.193461Z 2 [Note] [MY-000000] [Galera] Server 10.44.161.186 connected to cluster at position df948d8d-5994-11ec-91b5-ff5ae1e8c4ba:74756844 with ID 3de2d256-a766-11ee-851e-0327836c7f5c
2023-12-30T22:53:27.193820Z 2 [Note] [MY-000000] [WSREP] Server status change disconnected -> connected
2023-12-30T22:53:27.194171Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2023-12-30T22:53:27.194502Z 2 [Note] [MY-000000] [Galera] ####### My UUID: 3de2d256-a766-11ee-851e-0327836c7f5c
2023-12-30T22:53:27.194852Z 2 [Note] [MY-000000] [Galera] Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
2023-12-30T22:53:27.195228Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2023-12-30T22:53:27.195640Z 2 [Note] [MY-000000] [Galera] ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: -1
2023-12-30T22:53:27.195960Z 2 [Note] [MY-000000] [Galera] State transfer required: 
        Group state: df948d8d-5994-11ec-91b5-ff5ae1e8c4ba:74756844
        Local state: 00000000-0000-0000-0000-000000000000:-1
2023-12-30T22:53:27.196291Z 2 [Note] [MY-000000] [WSREP] Server status change connected -> joiner
2023-12-30T22:53:27.196625Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2023-12-30T22:53:27.197078Z 0 [Note] [MY-000000] [WSREP] Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.44.161.186' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '28944' --mysqld-version '8.0.25-15.1'  --binlog 'binlog' )
2023-12-30T22:53:27.955900Z 2 [Note] [MY-000000] [WSREP] Prepared SST request: xtrabackup-v2|10.44.161.186:4444/xtrabackup_sst//1
2023-12-30T22:53:27.956382Z 2 [Note] [MY-000000] [Galera] Check if state gap can be serviced using IST
2023-12-30T22:53:27.956747Z 2 [Note] [MY-000000] [Galera] Local UUID: 00000000-0000-0000-0000-000000000000 != Group UUID: df948d8d-5994-11ec-91b5-ff5ae1e8c4ba
2023-12-30T22:53:27.957081Z 2 [Note] [MY-000000] [Galera] ####### IST uuid:00000000-0000-0000-0000-000000000000 f: 0, l: 74756844, STRv: 3
2023-12-30T22:53:27.957514Z 2 [Note] [MY-000000] [Galera] IST receiver addr using ssl://10.44.161.186:4568
2023-12-30T22:53:27.957930Z 2 [Note] [MY-000000] [Galera] IST receiver using ssl
2023-12-30T22:53:27.958733Z 2 [Note] [MY-000000] [Galera] Prepared IST receiver for 0-74756844, listening at: ssl://10.44.161.186:4568
2023-12-30T22:53:27.959895Z 0 [Note] [MY-000000] [Galera] Member 0.0 (10.44.161.186) requested state transfer from '10.44.161.63'. Selected 1.0 (10.44.161.63)(SYNCED) as donor.
2023-12-30T22:53:27.960299Z 0 [Note] [MY-000000] [Galera] Shifting PRIMARY -> JOINER (TO: 74756844)
2023-12-30T22:53:27.960749Z 2 [Note] [MY-000000] [Galera] Requesting state transfer: success, donor: 1
2023-12-30T22:53:27.961130Z 2 [Note] [MY-000000] [Galera] Resetting GCache seqno map due to different histories.
2023-12-30T22:53:27.961512Z 2 [Note] [MY-000000] [Galera] GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> df948d8d-5994-11ec-91b5-ff5ae1e8c4ba:74756844
2023-12-30T22:53:28.700385Z 0 [Note] [MY-000000] [WSREP-SST] Proceeding with SST.........
2023-12-30T22:53:28.725787Z 0 [Note] [MY-000000] [WSREP-SST] ............Waiting for SST streaming to complete!
2023-12-30T22:53:29.687830Z 0 [Note] [MY-000000] [Galera] (3de2d256-851e, 'ssl://0.0.0.0:4567') turning message relay requesting off
2023-12-30T23:12:51.151208Z 0 [ERROR] [MY-000000] [WSREP-SST] Killing SST (29492) with SIGKILL after stalling for 120 seconds
2023-12-30T23:12:51.175953Z 0 [Note] [MY-000000] [WSREP-SST] /usr/bin/wsrep_sst_xtrabackup-v2: line 185: 29494 Killed                  socat -u openssl-listen:4444,reuseaddr,cert=/etc/mysql/certs/server.crt,key=/etc/mysql/certs/server.key,cafile=/etc/mysql/certs/server_ca.crt,verify=1,retry=30 stdio
2023-12-30T23:12:51.176015Z 0 [Note] [MY-000000] [WSREP-SST]      29495                       | /usr/bin/pxc_extra/pxb-8.0/bin/xbstream -x
2023-12-30T23:12:51.176289Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR ********************** 
2023-12-30T23:12:51.176359Z 0 [ERROR] [MY-000000] [WSREP-SST] Error while getting data from donor node:  exit codes: 137 137
2023-12-30T23:12:51.176411Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 1296
2023-12-30T23:12:51.176461Z 0 [ERROR] [MY-000000] [WSREP-SST] ****************************************************** 
2023-12-30T23:12:51.176922Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:32
2023-12-30T23:12:51.247447Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.44.161.186' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '28944' --mysqld-version '8.0.25-15.1'  --binlog 'binlog' : 32 (Broken pipe)
2023-12-30T23:12:51.247559Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2023-12-30T23:12:51.247632Z 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 32 (Broken pipe)
2023-12-30T23:12:51.247938Z 3 [Note] [MY-000000] [Galera] Processing SST received
2023-12-30T23:12:51.248002Z 3 [Note] [MY-000000] [Galera] SST request was cancelled
2023-12-30T23:12:51.248062Z 3 [ERROR] [MY-000000] [Galera] State transfer request failed unrecoverably: 32 (Broken pipe). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
2023-12-30T23:12:51.248106Z 3 [Note] [MY-000000] [Galera] ReplicatorSMM::abort()
2023-12-30T23:12:51.248140Z 3 [Note] [MY-000000] [Galera] Closing send monitor...
2023-12-30T23:12:51.248179Z 3 [Note] [MY-000000] [Galera] Closed send monitor.
2023-12-30T23:12:51.248210Z 3 [Note] [MY-000000] [Galera] gcomm: terminating thread
2023-12-30T23:12:51.248249Z 3 [Note] [MY-000000] [Galera] gcomm: joining thread
2023-12-30T23:12:51.248463Z 3 [Note] [MY-000000] [Galera] gcomm: closing backend
2023-12-30T23:12:51.249436Z 3 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node

And this is the log of DONOR node:

        231231 07:12:46 >> log scanned up to (348894019774)
        231231 07:12:47 >> log scanned up to (348894020705)
        231231 07:12:48 >> log scanned up to (348894021839)
        231231 07:12:49 >> log scanned up to (348894022973)
        231231 07:12:50 >> log scanned up to (348894029347)
        231231 07:12:51 >> log scanned up to (348894030604)
        xtrabackup: Error writing file '<unopen fd>' (OS errno 32 - Broken pipe)
        xb_stream_write_data() failed.
        xtrabackup: Error writing file '<unopen fd>' (OS errno 32 - Broken pipe)
        [03] xtrabackup: Error: xtrabackup_copy_datafile() failed.
        [03] xtrabackup: Error: failed to copy datafile.

2023-12-30T23:12:51.296464Z 0 [ERROR] [MY-000000] [WSREP-SST] ------------ innobackup.backup.log (END) ------------
2023-12-30T23:12:51.296473Z 0 [ERROR] [MY-000000] [WSREP-SST] ****************************************************** 
2023-12-30T23:12:51.296507Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:22
2023-12-30T23:12:51.325538Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.44.161.186:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '8.0.25-15.1'  --binlog 'binlog' --gtid 'df948d8d-5994-11ec-91b5-ff5ae1e8c4ba:74756844' : 22 (Invalid argument)
2023-12-30T23:12:51.333491Z 0 [Note] [MY-000000] [Galera] SST sending failed: -22
2023-12-30T23:12:51.333528Z 0 [Note] [MY-000000] [WSREP] Server status change donor -> joined
2023-12-30T23:12:51.333550Z 0 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2023-12-30T23:12:51.333607Z 0 [ERROR] [MY-000000] [WSREP] Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.44.161.186:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib64/mysql/plugin/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '8.0.25-15.1'  --binlog 'binlog' --gtid 'df948d8d-5994-11ec-91b5-ff5ae1e8c4ba:74756844' 
2023-12-30T23:12:51.333968Z 0 [Warning] [MY-000000] [Galera] Could not find peer: 3de2d256-a766-11ee-851e-0327836c7f5c
2023-12-30T23:12:51.333996Z 0 [Warning] [MY-000000] [Galera] 0.0 (10.44.161.63): State transfer to -1.-1 (left the group) failed: -22 (Invalid argument)
2023-12-30T23:12:51.334018Z 0 [Note] [MY-000000] [Galera] Shifting DONOR/DESYNCED -> JOINED (TO: 74758908)
2023-12-30T23:12:51.334404Z 0 [Note] [MY-000000] [Galera] Member 0.0 (10.44.161.63) synced with group.
2023-12-30T23:12:51.334421Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 74758908)
2023-12-30T23:12:51.334469Z 14 [Note] [MY-000000] [Galera] Server 10.44.161.63 synced with group
2023-12-30T23:12:51.334514Z 14 [Note] [MY-000000] [WSREP] Server status change joined -> synced
2023-12-30T23:12:51.334527Z 14 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2023-12-30T23:12:51.334557Z 14 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2023-12-30T23:12:56.339184Z 0 [Note] [MY-000000] [Galera]  cleaning up 3de2d256-851e (ssl://10.44.161.186:4567)

This is the mysqld status in DB01 and DB03 still working.

[root@db01 ~]# systemctl status mysqld
mysql.service - Percona XtraDB Cluster
   Loaded: loaded (/usr/lib/systemd/system/mysql.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/mysql.service.d
           └─override.conf
   Active: active (running) since Wed 2024-01-03 15:53:06 +08; 16min ago
  Process: 18927 ExecStopPost=/usr/bin/mysql-systemd stop-post (code=exited, status=0/SUCCESS)
  Process: 21685 ExecStop=/usr/bin/mysql-systemd stop (code=exited, status=0/SUCCESS)
  Process: 32300 ExecStartPost=/usr/bin/mysql-systemd start-post $MAINPID (code=exited, status=0/SUCCESS)
  Process: 32297 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
  Process: 29922 ExecStartPre=/bin/sh -c VAR=`bash /usr/bin/mysql-systemd galera-recovery`; [ $? -eq 0 ] && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=0/SUCCESS)
  Process: 29919 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
  Process: 29891 ExecStartPre=/usr/bin/mysql-systemd start-pre (code=exited, status=0/SUCCESS)
 Main PID: 29960 (mysqld)
   Status: "Server is operational"
   CGroup: /system.slice/mysql.service
           └─29960 /usr/sbin/mysqld --wsrep_start_position=df948d8d-5994-11ec-91b5-ff5ae1e8c4ba:75099198

I tried many times, but it keeps the same error.

@hung_le, are you perhaps out of disk space on the JOINER? It looks like the backup progresses for about 18 minutes then halts due to no progress.

Hi @matthewb , disk is still enough (more than 50gb).

MySQL should not be running on the JOINER. MySQL should be running on both DONORs.

Sorry, I updated the my post, DB01 and DB03 are still running. My DB02 seems lost all data in /var/lib/mysql, once I start DB02, it triggers SST process, but failed at the end, like the log.

Even I tried to create new host then join the cluster, it still failed. @matthewb do you need any log ? I can collect.

@hung_le, I would start doing some diagnostics. Try taking a backup of DB3 manually and sending it over to DB2 via socat (just like the SST process). If that also fails, the problem is deeper than MySQL.

@matthewb , can I copy all data in /var/lib/mysql of DB3 by using scp to DB2 ? Then just start DB2 right after ?

No, that’s not how you take backups of a MySQL server. You need to use a tool like Percona xtrabackup to make a physical backup of the MySQL. As that backup is happening, you need to stream it using xbstream + socat to db3. This is the same procedure that SST uses. If you do it manually and it works, then we know the issue is within the SST process. If the manual fails, then there’s some other issue with networking/OS.

Can I generate backup using Percona xtrabackup on DB03 then scp data fo DB02 (the down node) ? I will use Compressed backup then scp it to DB02, Compressed Backup - Percona XtraBackup.

I would not as the SST process does not use scp; the SST process uses xbstream + socat. By using scp, you would not be recreating the issue. You can do it with scp just to bring the node back online, but you might be in this same situation again.

Could you explain me how to use xbstream + socat ?

Can I use this command to do that: ssh user@desthost “( nc -l 9999 > /data/backups/backup.xbstream & )” && xtrabackup --backup --stream=xbstream ./ | nc desthost 9999 ?

  1. If use that command, there is an error: xtrabackup: Error: unknown argument: ‘./’
  2. If I use that command without ‘./’, it errors:
xtrabackup version 8.0.26-18 based on MySQL server 8.0.26 Linux (x86_64) (revision id: 4aecf82)
240104 16:08:45  version_check Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_group=xtrabackup;mysql_socket=/var/lib/mysql/mysql.sock' as 'zcon'  (using password: YES).
240104 16:08:45  version_check Connected to MySQL server
240104 16:08:45  version_check Executing a version check against the server...
240104 16:08:45  version_check Done.
240104 16:08:45 Connecting to MySQL server host: localhost, user: zcon, password: set, port: not set, socket: /var/lib/mysql/mysql.sock
Using server version 8.0.25-15.1
240104 16:08:45 Executing LOCK TABLES FOR BACKUP...
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /var/lib/mysql
xtrabackup: open files limit requested 0, set to 1024
xtrabackup: using the following InnoDB configuration:
xtrabackup:   innodb_data_home_dir = .
xtrabackup:   innodb_data_file_path = ibdata1:100M:autoextend
xtrabackup:   innodb_log_group_home_dir = ./
xtrabackup:   innodb_log_files_in_group = 2
xtrabackup:   innodb_log_file_size = 536870912
xtrabackup: using O_DIRECT
Number of pools: 1
xtrabackup: inititialize_service_handles suceeded
240104 16:08:45 Connecting to MySQL server host: localhost, user: zcon, password: set, port: not set, socket: /var/lib/mysql/mysql.sock
xtrabackup: Redo Log Archiving is not set up.
WARNING: unknown option --ssl=0
Starting to parse redo log at lsn = 350607168072
Recovery parsing buffer extended to 4194304.
Recovery parsing buffer extended to 8388608.
Recovery parsing buffer extended to 16777216.
xtrabackup: Error writing file '<unopen fd>' (OS errno 32 - Broken pipe)
xb_stream_write_data() failed.
xtrabackup: Error: write to logfile failed
  1. One more question, after running backup and restore, I think DB02 still does not have grastate.dat, so after starting DB02, it would trigger SST again and can lead to empty the folder /var/lib/mysql again

You cannot use 1 command. You must setup socat to listen on db2 before you start the backup on db1. nc is a very old program. socat is the modern replacement.

db2:

cd /var/lib/mysql
socat - TCP-LISTEN:3306 | xbstream -vx

db1:

xtrabackup --backup --parallel 4 --galera-info --stream=xbstream | socat - TCP:db2:3306

After the backup stream finishes, you need to prepare the backup on db2, recreate the grastate.dat file using the GTID in xtrabackup_galera_info file.

Thank you, @matthewb I manually run SST, it works fine.

If running it manually works, then there is more to diagnose during the actual SST process.

After a long investigation, I can conclude that there is the network issue on our own system. Thank you @matthewb. Now after fixing network issue, all work well.