This seems like and edge case where the server last write is in the edge of the last block of a file. There is no new write, meaning the server does not create the next file. PXB consumes the last block of the file and attempt to reopen the files to discover the new file, however it does not exists yet.
Is that happen with full backup only or with incremental backup also ?
When the backup fails, did you observed any DDL or some scheduled process/Cron on the database which is unusual from daily routines ?
May be you can try changing the backup time slot to see if it reduces such occurrences if feasible.
All backup is full backup; There are nearly a hundred separate MySQL instances that are backed up daily, not every one of them fails, and the ones that do fail don’t fail every day;
Backups are running late at night with no DDL running and no other crontab tasks. Observed dozens of failures on different MySQL instances so far;
To adjust the backup time, we will try
–register-redo-log-consumer wili lock the DB? which can have an impact on the business, this has been ignored. “redo log archiving” We will be trying;
Currently, backups of failed databases range in size from 70G to 150G, 300 tables.
Backup when business is closed, never fails.
innobackupex --parallel 8 --no-timestamp --stream=xbstream …
–register-redo-log-consumer wili lock the DB? which can have an impact on the business, this has been ignored. “redo log archiving” We will be trying;
Yes, the writes could be blocked if using –register-redo-log-consumer that’s why it is better to perform the backups in any backup or secondary node.
Exactly what kind of database topology you having ? Are you doing the backups on particular backup node or the main active nodes ?
If you taking backups on the native async replicas you can also try using --safe-slave-backup which ensures a consistent backup by stopping the replica thread.
Also in PXC/Galera based environment you can try the backup with less overhead and impact by enabling wsrep_desync parameter. This also controls the FC thingy.
`SET GLOBAL wsrep_desync=1;`
Backup process.
`SET GLOBAL wsrep_desync=0;`
Backup when business is closed, never fails.
innobackupex --parallel 8 --no-timestamp --stream=xbstream …
So, this is something related to workload. Did you verified if not writing more data which is not able to cater as per the current redo log size ? In that case, you could also try to increase the existing redo log size and observe if the issue resolves.
Let me share some BP about setting redo log size.
I see you are using innodb_redo_log_capacity so you can also directly increase the same if required.
Let 's try apply the changes we discussed so far [different backup time, redo log consumer/archiving] or if applicable [–safe-slave-backup or wsrep_desync] and let us know if you see any positive sign.
By any chance, when your backup fails did you able to get any database information like [SHOW FULL PROCESSLIST\G, SHOW ENGINE INNODB STATUS\G etc] or any database/OS related which might helpful to correlate the patters ?
mysql> select 268433408/1024/1024;
±--------------------+
| 268433408/1024/1024 |
±--------------------+
| 255.99804688 |
±--------------------+
1 row in set (0.00 sec)
The gap doesn’t seems to be that big so shouldn’t be a problem.
Have you applied any previously recommended changes. Let’s try shifting the backup over the Replica with those changes and then let us know if you still facing issues or not.