Xtrabackup backup failed

2024-07-22T04:29:54.992365+08:00 1 [Note] [MY-011825] [Xtrabackup] >> log scanned up to (1306803205971)
2024-07-22T04:29:55.993273+08:00 1 [Note] [MY-011825] [Xtrabackup] >> log scanned up to (1306803441664)
2024-07-22T04:29:56.994424+08:00 1 [ERROR] [MY-011825] [Xtrabackup] could not find redo log file with LSN 1306803441664
2024-07-22T04:29:56.994484+08:00 1 [ERROR] [MY-011825] [Xtrabackup] read_logfile() failed.
2024-07-22T04:29:57.769194+08:00 0 [ERROR] [MY-011825] [Xtrabackup] log copying failed.

version:
percona-xtrabackup-80 8.0.35-30-1.bookworm
mysql-community-server-core 8.0.37-1debian12

cli:
innobackupex --parallel 1 --no-timestamp --stream=xbstream …

mysql config:
innodb_redo_log_capacity 4194304000

Backups are daily, not every backup fails, about once a month.

@s0e0c0

We noticed some similar issue tracked in jira - [PXB-3023] - Percona JIRA

This seems like and edge case where the server last write is in the edge of the last block of a file. There is no new write, meaning the server does not create the next file. PXB consumes the last block of the file and attempt to reopen the files to discover the new file, however it does not exists yet.

Is that happen with full backup only or with incremental backup also ?

When the backup fails, did you observed any DDL or some scheduled process/Cron on the database which is unusual from daily routines ?

May be you can try changing the backup time slot to see if it reduces such occurrences if feasible.

There are some features in PXB such as –register-redo-log-consumer or redo log archiving which you can try in order to have a more redo log availability if this is the case here.

How big is your overall database size and how many tables you having in your DB ?

  1. All backup is full backup; There are nearly a hundred separate MySQL instances that are backed up daily, not every one of them fails, and the ones that do fail don’t fail every day;
  2. Backups are running late at night with no DDL running and no other crontab tasks. Observed dozens of failures on different MySQL instances so far;
  3. To adjust the backup time, we will try
  4. –register-redo-log-consumer wili lock the DB? which can have an impact on the business, this has been ignored. “redo log archiving” We will be trying;
  5. Currently, backups of failed databases range in size from 70G to 150G, 300 tables.

Backup when business is closed, never fails.
innobackupex --parallel 8 --no-timestamp --stream=xbstream …

@s0e0c0

Thanks for confirming the details.

  1. –register-redo-log-consumer wili lock the DB? which can have an impact on the business, this has been ignored. “redo log archiving” We will be trying;

Yes, the writes could be blocked if using –register-redo-log-consumer that’s why it is better to perform the backups in any backup or secondary node.

Exactly what kind of database topology you having ? Are you doing the backups on particular backup node or the main active nodes ?

If you taking backups on the native async replicas you can also try using --safe-slave-backup which ensures a consistent backup by stopping the replica thread.

Also in PXC/Galera based environment you can try the backup with less overhead and impact by enabling wsrep_desync parameter. This also controls the FC thingy.

`SET GLOBAL wsrep_desync=1;`
Backup process.
`SET GLOBAL wsrep_desync=0;`

Backup when business is closed, never fails.
innobackupex --parallel 8 --no-timestamp --stream=xbstream …

So, this is something related to workload. Did you verified if not writing more data which is not able to cater as per the current redo log size ? In that case, you could also try to increase the existing redo log size and observe if the issue resolves.

Let me share some BP about setting redo log size.

I see you are using innodb_redo_log_capacity so you can also directly increase the same if required.

Let 's try apply the changes we discussed so far [different backup time, redo log consumer/archiving] or if applicable [–safe-slave-backup or wsrep_desync] and let us know if you see any positive sign.

By any chance, when your backup fails did you able to get any database information like [SHOW FULL PROCESSLIST\G, SHOW ENGINE INNODB STATUS\G etc] or any database/OS related which might helpful to correlate the patters ?

  1. Source (one node) → replica(one node), backups only run on source, for some reason;
  2. innodb_redo_log_capacity this parameters have been adjusted several times, from 256M to the current 8G;
  3. Because not every backup fails, have one “SELECT FILE_NAME,START_LSN,END_LSN FROM performance_schema.innodb_redo_log_files;” result

[ERROR] [MY-011825] [Xtrabackup] could not find redo log file with LSN 676510529536


SELECT FILE_NAME,START_LSN,END_LSN FROMperformance_schema.innodb_redo_log_files;
FILE_NAME START_LSN END_LSN
./@innodn_redo/#ibredo3714 671410294784 xxxxxxxxxxx


./@innodn_redo/#ibredo3732 676242096128 676510529536
./@innodn_redo/#ibredo3733 676510529536 676778962944

  1. Added a few other outputs to wait until the next backup fails to know what happened then

@s0e0c0

Thanks for your inputs!

Source (one node) → replica(one node), backups only run on source, for some reason;

Okay, so this is a native async replication setup. Correct ?

Still you should take such backups in some dedicated Backup/Replica node with the suggested changes ?

SELECT FILE_NAME,START_LSN,END_LSN FROMperformance_schema.innodb_redo_log_files;
FILE_NAME START_LSN END_LSN
./@innodn_redo/#ibredo3714 671410294784 xxxxxxxxxxx


./@innodn_redo/#ibredo3732 676242096128 676510529536
./@innodn_redo/#ibredo3733 676510529536 676778962944

mysql> select 268433408/1024/1024;
±--------------------+
| 268433408/1024/1024 |
±--------------------+
| 255.99804688 |
±--------------------+
1 row in set (0.00 sec)

The gap doesn’t seems to be that big so shouldn’t be a problem.

Have you applied any previously recommended changes. Let’s try shifting the backup over the Replica with those changes and then let us know if you still facing issues or not.

Source (one node) → replica(another node), not a native async replication setup.

The backups have now been adjusted to be done on the slave and no anomalies have been found with the backups so far.

@s0e0c0

The backups have now been adjusted to be done on the slave and no anomalies have been found with the backups so far.

That’s great. Please observe the same for a couple of more days and let us know if you face any further problems!!

Regards,
Anil