Innobackupex - xtrabackup_logfile got the error: Errcode: 116 - Stale file handle at the last step

Hi everyone,

We’re encountering an issue with our database backups, which can take several days due to the large size (10TB to 40TB). The backups are written to an NFS share. Unfortunately, at the very end of the process, when writing information to the xtrabackup_logfile, we’re getting a “Stale file handle” error (Errcode: 116).

This error renders the entire backup unusable, and retries haven’t been successful. We suspect the issue might be related to the xtrabackup_logfile’s file handle becoming outdated during the lengthy backup process.

Here’s our question:

  • Is there a way to handle the “stale file handle” error during xtrabackup backups? Could we potentially retry with a different approach instead of relying on the original file handle created at the beginning (which can be as old as 48 hours)?

If anyone has encountered this error or has suggestions on how to avoid it, your input would be greatly appreciated!

Hello @kart,
You are not the first one who has posted on the forums about backup to NFS having stale files. This seems to be a common issue with NFS-based backups.

I would recommend that you take advantage of multiple PXB streams to backup your database much faster than “several days”.

With this approach, the files are held in memory and streamed as data is written rather than holding open a file descriptor which can go stale.

For a database of that size, the better-practice is to use filesystem snapshots like ZFS, btrfs, or EBS (ie: cloud snaps) as you can take the snapshot in under 10s and then transfer the snapshot to another system in the background.

Thank you @matthewb for the quick reply.

I will start exploring the multiple PXB Streams.

By the way, could you please elaborate this FS snapshots. Can I restore the same as I do with Xtrabackupex? Please advise or any materials to check. I will also google about it, thank you.

Check which filesystems support snapshots. In session 1, run FLUSH TABLES WITH READ LOCK, then in session 2, run the snapshot command (eg: aws ebs snapshot /path/volume), after that returns, in session 1, exit session to unlock everything. Snapshot done in < 30s.

To restore, create a new server, restore the snapshot, start mysql.