Any way to realtime decompress xtrabackup streams?

Hi there!

We have a handful of relatively large (5-10TB) MySQL databases spread across multiple datacenters for redundancy. When we want to spin up a new replica node, we utilize xtrabackup to dump data from the primary, and send to the new replica for its initial sync.

Currently we are utilizing the --stream option to save time and avoid unnecessarily filling disk space on the primary node, but often run into concerns over how long it can take to stream several TB over the internet (when going between datacenters).

We’ve introduced compression to help with this, but currently we have to stream compressed data to the replica, then wait for that compressed data to be decompressed, before it can be prepared and synced. This is not a huge concern, but decompression can take a notable amount of time, which adds anxiety when running up against our transaction log retention limit.

So as the title asks, I’m curious if there is any way to decompress our stream in realtime on the replica, as it is received?

Right now our command for streaming to a new replica, as executed on the primary node, is as follows:

xtrabackup --backup --stream=xbstream --compress | ssh user@replica "cd ~/xtrabackup_backupfiles && xbstream -x"

However, as is, this results in a compressed dataset in ~/xtrabackup_backupfiles on the replica when the stream concludes, requiring additional time spent decompressing, during which time the replica falls further behind the primary node, adding even more time to catch up before it is finally in sync.

I have tried a few variations of commands on the replica I thought might work, e.g.

... cd ~/xtrabackup_backupfiles && mbuffer -m 1G | zstd -d | xbstream -x

… which fails because zstd can’t directly parse the xbstream data (I assume because xbstream only compresses the data, and not the metadata). Or:

... cd ~/xtrabackup_backupfiles && xbstream -x -c - | mbuffer -m 1G | zstd -d

… which would theoretically parse the xbstream format first, potentially solving the previous issue, but -x and -c can’t be paired in xbstream.

Is there some magical combination anyone is aware of that would allow realtime decompression so my replica is ready to prepare as soon as the stream concludes?

1 Like

xbstream does not compress nor decompress. xbstream is just an IO multiplexer. You can read from X number of file descriptors in parallel and write them to a single output. This is how xtrabackup is able to parallelize backups over a single network connection. As the files are read in chunks, they can be compressed and sent over. But there’s nothing on the receiving side to do the opposite. PXB can only send, it cannot receive.

I hope you are using parallel decompression on the receiving side xtrabackup --decompress --parallel 16 --remove-original

You can extract directly; maybe you can write a watcher script that when it sees a new file it will start decompressing? socat TCP-LISTEN:3333 - | xbstream -vx ← this will reconstitute files immediately once fully received. You could then decompress.

1 Like

Hi Brick,
I understand you want to avoid the decompressing step on the destination. When doing streaming backup, instead of using the --compress option, you may just put any compression tool you prefer via pipe to get the compression AND decompression process on the fly. This is well demonstrated here:
Just a note here, instead of pigz, I’d use zstd here for an even better experience.
Hope this is what you are looking for?

1 Like

Thanks matthewb and przemek for that feedback,

The solution provided by @przemek was exactly what I was looking for, although instead of socat I stuck to ssh for security and simplicity. I also used zstd instead of pigz, as recommended. I found zstd was only marginally better than pigz out of the gate, but I was able to tune it far better.

My final command is as follows:

sudo xtrabackup --backup --stream=xbstream | mbuffer -m 8G | zstd -9 -T10 | ssh user@replica "mbuffer -m 8G | zstd -d -T8 | xbstream -x -C ~/xtrabackup_backupfiles"

I found a compression level of -9 to be best for me, given my base internet speed between datacenters of around 100mb/s (lower ratios were still limited by network speeds), and the compute resources available (higher ratios were limited by the data rate zstd could handle, and began impacting resources availability for MySQL).

For anyone looking to reuse this, I highly suggest trying other ratios (zstd supports levels 1-22); the ideal level for your circumstances will likely vary. One side benefit of using mbuffer is it will report the average throughput once the stream concludes (or is killed), so I just repeatedly killed my backup after the first 10GB until I found the best settings for my situation.

I found the -T10 flag for 10 parallel compression threads to be ideal from my primary, which has 16 logical cores. It slightly edged out -T8, but any higher seemed to diminish again, presumably due to increased overhead. As with compression ratio, your mileage may vary.

All in all, I was able to achieve around 600mb/s equivalent - i.e. a 6:1 realtime compression ratio. Thanks for the help!

1 Like