I am running a production MySQL 5.7 (Galera Cluster, 3 nodes) environment and facing extremely long backup times with XtraBackup (innobackupex 2.4.29). I would like to validate whether this behavior is expected given our architecture or if there are recommended optimizations.
Database is highly active (write-heavy Galera cluster)
NFS export is configured with sync
Disks on both DB and backup server are HDD (not SSD/NVMe)
Cluster status during backup:
wsrep_cluster_status = Primary
wsrep_local_state_comment = Synced
Flow control mostly OFF
One node handles heavy traffic (~280 connections), backup is taken from a lower-traffic node (~10–20 connections)
Constraints / Requirements
Production environment (cannot stop writes)
Must maintain Galera consistency
Prefer not to use aggressive settings that may cause replication lag
Need to ensure at least one valid full backup is always available (no destructive rotation)
Backup disk is remote NFS storage
Questions
Is >24 hour backup time expected for ~10 TB dataset over NFS (sync) on HDD storage?
Would switching NFS export from sync to async be considered safe/recommended for XtraBackup workloads?
Is it better practice to:
Backup to local disk first, then rsync to NFS?
Or write directly to NFS in large environments?
Are the following options recommended for large HDD + NFS setups?
--parallel
--rsync
--throttle
--compress / --stream
For Galera specifically, is there any official guidance on:
Best node selection for backups
Avoiding cluster performance impact during 10TB+ full backups
Additional Concern
Because backups take longer than 24 hours, cron jobs can overlap and risk deleting previous valid backups before a new one is fully completed. We are redesigning the rotation logic to be non-destructive and atomic.
Any best practices for very large (8–12 TB) Galera clusters using XtraBackup over NFS would be greatly appreciated.
They using same storage by the way, I will change the backup storage after 1 month.
5.7 has been dead for many years. 8.0 will EOL in less than 60 days. You are missing out on many, many security fixes, performance improvement patches, etc.
XtraBackup (innobackupex 2.4.29)
This has also been dead for many years. Again, you’re missing out on new features in Xtrabackup 8.4 that improve backup processes.
both virtual machines and may be on the same underlying storage pool
That will hurt performance of everything. Not only are the PXC VMs fighting for disk IO, now your backup is also fighting for the same IO. And with HDDs, the IO is extremely low, and slow. Are you monitoring HDD performance anywhere?
Is >24 hour backup time expected for ~10 TB dataset over NFS (sync) on HDD storage?
10TB = 10,485,760MB
Let’s say your 7200RPM HDD writes at 120MB/s. That’s assuming absolutely nothing else is using that disk. 10,485,760 / 120 = 87,381s = 24.3hrs Now, add all the VM overhead, network overhead, application traffic, etc and you’ll see that >24hr backup time is absolutely expected.
Backup to local disk first, then rsync to NFS?
If you have the local disk space for this, this would be faster, but not by much considering the volume of data you have.
Any best practices for very large (8–12 TB) Galera clusters using XtraBackup over NFS would be greatly appreciated.
Stop using NFS. Instead, stream the backup directly to the storage server. This bypasses all the NFS network, and disk overhead. Docs: Take a streaming backup - Percona XtraBackup Look at the example “Send the backup to another server using netcat”
–parallel always 4 or 8, depending on how much CPU you have available
–compress always Modern versions of PXB use zstd which is much better/faster than older zlib
With datasets > 2TB, we recommend one of the following approaches to backups:
Use snapshots. Disk/VM snapshots are disk cheap (since they only store deltas), and should perform quickly.
Switch to using an incremental backup style. Example: Every Sunday, take a full backup. On Monday, take an incremental backup using Sunday as the base. The Monday backup will only backup what was changed since Sunday. On Tuesday, take another incremental using Monday as the base. The Tuesday backup will only be the difference since Monday, etc, etc. If you are only changing a few GB per day, the incremental backup will go blazingly fast.
As @matthewb described, your 24+ hour times are expected given the HDD/NFS path. I benchmarked PS 5.7.44 and XtraBackup 2.4.29 on a ~2 GB dataset (20 InnoDB tables, innodb_file_per_table=ON) to quantify his recommendations. The relative speedups scale to your 10 TB setup:
Flags
Time
Size
vs Baseline
(default, no flags)
25s
2583 MB
1.0x
--parallel=4
10s
2583 MB
2.5x
--compress --compress-threads=4
20s
63 MB
1.2x
Combined + --stream=xbstream
7s
62 MB
3.5x
Incremental (after 1% data change)
18s
42 MB
—
The 97% compression ratio here is inflated by synthetic data. Expect 60-80% on production InnoDB data, which still means writing 2-4 TB instead of 10 TB over your NFS link. Since HDD write throughput is the bottleneck, that alone could cut your backup time in half.
The biggest win is streaming directly to the backup server, bypassing NFS entirely as matthewb recommended:
The --galera-info flag captures the GTID position, which you will need for cluster-aware restores. For decompression and prepare on the backup server, see the streaming backup docs.
For the incremental strategy matthewb described: weekly full + daily incrementals. With 1% daily change on 10 TB, each incremental writes roughly 100 GB instead of 10 TB and finishes in minutes rather than hours.
If you must keep using NFS temporarily, check your mount options. The default rsize/wsize of 32 KB causes severe overhead on large sequential writes. Increasing them to 1 MB and switching to async makes a significant difference:
mount -t nfs4 -o rsize=1048576,wsize=1048576,async,noatime backup-server:/DBBACKUP /BACKUP
Your node selection (low-traffic node) is correct. If you see wsrep_flow_control_paused rising during backups, --throttle can cap XtraBackup’s read I/O rate at the cost of longer backup times.