XtraBackup 2.4 Full Backup Taking >24 Hours on 10TB MySQL 5.7 Galera Cluster over NFS (HDD Storage)

Hello,

I am running a production MySQL 5.7 (Galera Cluster, 3 nodes) environment and facing extremely long backup times with XtraBackup (innobackupex 2.4.29). I would like to validate whether this behavior is expected given our architecture or if there are recommended optimizations.

Environment Details

  • MySQL Version: 5.7.44 (Galera)

  • XtraBackup Version: 2.4.29

  • Cluster Size: 3 nodes (wsrep_cluster_status = Primary)

  • Engine: InnoDB (file_per_table=1)

  • Datadir: /DB/mysql

  • Database Size: ~9.7 TB

  • OS: Linux (VMware Virtual Machines)

  • CPU: 24 vCPU

  • RAM: 78 GB (InnoDB Buffer Pool: 40 GB)

Storage Architecture

DB Node :

  • Datadir: Local LVM volume (/DB/mysql)

  • Disk Type: HDD (ROTA=1, virtual disks)

  • Size: ~11 TB

Backup Target:

  • Mounted via NFS v4.2 to /BACKUP

  • NFS Export: /DBBACKUP *(rw,sync,no_root_squash,insecure)

  • Filesystem: XFS on LVM (22 TB total, ~12 TB used)

  • Underlying disks: All HDD (multiple virtual disks, ROTA=1)

  • Network: 10 Gbps

Important: DB node and backup server are both virtual machines and may be on the same underlying storage pool (VMware datastore).

Current Backup Method

Backup is executed from one Galera node using:

innobackupex --user=root --password=XXX --no-timestamp /BACKUP/DBBackup/
innobackupex --apply-log --use-memory=4G /BACKUP/DBBackup/

Backups are written directly to the NFS mount.

Observed Behavior

  • Full backup duration: 24–30+ hours

  • Database is highly active (write-heavy Galera cluster)

  • NFS export is configured with sync

  • Disks on both DB and backup server are HDD (not SSD/NVMe)

Cluster status during backup:

  • wsrep_cluster_status = Primary

  • wsrep_local_state_comment = Synced

  • Flow control mostly OFF

  • One node handles heavy traffic (~280 connections), backup is taken from a lower-traffic node (~10–20 connections)

Constraints / Requirements

  • Production environment (cannot stop writes)

  • Must maintain Galera consistency

  • Prefer not to use aggressive settings that may cause replication lag

  • Need to ensure at least one valid full backup is always available (no destructive rotation)

  • Backup disk is remote NFS storage

Questions

  1. Is >24 hour backup time expected for ~10 TB dataset over NFS (sync) on HDD storage?

  2. Would switching NFS export from sync to async be considered safe/recommended for XtraBackup workloads?

  3. Is it better practice to:

    • Backup to local disk first, then rsync to NFS?

    • Or write directly to NFS in large environments?

  4. Are the following options recommended for large HDD + NFS setups?

    • --parallel

    • --rsync

    • --throttle

    • --compress / --stream

  5. For Galera specifically, is there any official guidance on:

    • Best node selection for backups

    • Avoiding cluster performance impact during 10TB+ full backups

Additional Concern

Because backups take longer than 24 hours, cron jobs can overlap and risk deleting previous valid backups before a new one is fully completed. We are redesigning the rotation logic to be non-destructive and atomic.

Any best practices for very large (8–12 TB) Galera clusters using XtraBackup over NFS would be greatly appreciated.

They using same storage by the way, I will change the backup storage after 1 month.

Thank you.

1 Like

Hello @aycelen,

production MySQL 5.7 (Galera Cluster, 3 nodes)

5.7 has been dead for many years. 8.0 will EOL in less than 60 days. You are missing out on many, many security fixes, performance improvement patches, etc.

XtraBackup (innobackupex 2.4.29)

This has also been dead for many years. Again, you’re missing out on new features in Xtrabackup 8.4 that improve backup processes.

both virtual machines and may be on the same underlying storage pool

That will hurt performance of everything. Not only are the PXC VMs fighting for disk IO, now your backup is also fighting for the same IO. And with HDDs, the IO is extremely low, and slow. Are you monitoring HDD performance anywhere?

  1. Is >24 hour backup time expected for ~10 TB dataset over NFS (sync) on HDD storage?

10TB = 10,485,760MB
Let’s say your 7200RPM HDD writes at 120MB/s. That’s assuming absolutely nothing else is using that disk. 10,485,760 / 120 = 87,381s = 24.3hrs Now, add all the VM overhead, network overhead, application traffic, etc and you’ll see that >24hr backup time is absolutely expected.

Backup to local disk first, then rsync to NFS?

If you have the local disk space for this, this would be faster, but not by much considering the volume of data you have.

Any best practices for very large (8–12 TB) Galera clusters using XtraBackup over NFS would be greatly appreciated.

  • Stop using NFS. Instead, stream the backup directly to the storage server. This bypasses all the NFS network, and disk overhead. Docs: Take a streaming backup - Percona XtraBackup Look at the example “Send the backup to another server using netcat”
  • –parallel always 4 or 8, depending on how much CPU you have available
  • –compress always Modern versions of PXB use zstd which is much better/faster than older zlib
  • With datasets > 2TB, we recommend one of the following approaches to backups:
  1. Use snapshots. Disk/VM snapshots are disk cheap (since they only store deltas), and should perform quickly.
  2. Switch to using an incremental backup style. Example: Every Sunday, take a full backup. On Monday, take an incremental backup using Sunday as the base. The Monday backup will only backup what was changed since Sunday. On Tuesday, take another incremental using Monday as the base. The Tuesday backup will only be the difference since Monday, etc, etc. If you are only changing a few GB per day, the incremental backup will go blazingly fast.
1 Like

Hi @aycelen,

As @matthewb described, your 24+ hour times are expected given the HDD/NFS path. I benchmarked PS 5.7.44 and XtraBackup 2.4.29 on a ~2 GB dataset (20 InnoDB tables, innodb_file_per_table=ON) to quantify his recommendations. The relative speedups scale to your 10 TB setup:

Flags Time Size vs Baseline
(default, no flags) 25s 2583 MB 1.0x
--parallel=4 10s 2583 MB 2.5x
--compress --compress-threads=4 20s 63 MB 1.2x
Combined + --stream=xbstream 7s 62 MB 3.5x
Incremental (after 1% data change) 18s 42 MB —

The 97% compression ratio here is inflated by synthetic data. Expect 60-80% on production InnoDB data, which still means writing 2-4 TB instead of 10 TB over your NFS link. Since HDD write throughput is the bottleneck, that alone could cut your backup time in half.

The biggest win is streaming directly to the backup server, bypassing NFS entirely as matthewb recommended:

innobackupex --user=root --password=XXX \
  --parallel=4 --compress --compress-threads=4 \
  --stream=xbstream --galera-info /tmp | \
  ssh backup-server "cat > /BACKUP/$(date +%F).xbstream"

The --galera-info flag captures the GTID position, which you will need for cluster-aware restores. For decompression and prepare on the backup server, see the streaming backup docs.

For the incremental strategy matthewb described: weekly full + daily incrementals. With 1% daily change on 10 TB, each incremental writes roughly 100 GB instead of 10 TB and finishes in minutes rather than hours.

If you must keep using NFS temporarily, check your mount options. The default rsize/wsize of 32 KB causes severe overhead on large sequential writes. Increasing them to 1 MB and switching to async makes a significant difference:

mount -t nfs4 -o rsize=1048576,wsize=1048576,async,noatime backup-server:/DBBACKUP /BACKUP

Your node selection (low-traffic node) is correct. If you see wsrep_flow_control_paused rising during backups, --throttle can cap XtraBackup’s read I/O rate at the cost of longer backup times.