Binlog gaps and slow restores

Description:

No info on why binlog gaps might occur, and no info on how to debug/examine this.

Additional Information:

We have a PXC 8.0.42 with operator 1.18.0 setup running, 3 nodes on EKS and we see that most of our daily full backups are marked with the “BinlogGap” annotation. This is not ideal. The backups are going to S3.

The PITR logcollector pod is doing it’s thing, and most of the time, after an upload cycle, this is the tail of the log (each time with different logs ofcourse).

xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 binlog.078318 (78073 bytes) [E:No]: 146109a7-04bf-11f0-a88e-d3a485b81481:6096370-6096600
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 binlog.078319 (63929 bytes) [E:No]: 146109a7-04bf-11f0-a88e-d3a485b81481:6096601-6096789
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 binlog.078320 (78073 bytes) [E:No]: 146109a7-04bf-11f0-a88e-d3a485b81481:6096790-6097020
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 binlog.078321 (71001 bytes) [E:No]: 146109a7-04bf-11f0-a88e-d3a485b81481:6097021-6097230
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 binlog.078322 (71001 bytes) [E:No]: 146109a7-04bf-11f0-a88e-d3a485b81481:6097231-6097440
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 no cache entry for binlog.078323
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 binlog.078323 (70957 bytes) [E:No]: 146109a7-04bf-11f0-a88e-d3a485b81481:6097441-6097650
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 updating binlog cache
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 last uploaded GTID set: 146109a7-04bf-11f0-a88e-d3a485b81481:6097231-6097440
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 checking binlog.078323 (146109a7-04bf-11f0-a88e-d3a485b81481:6097441-6097650) against last uploaded set
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 last uploaded 146109a7-04bf-11f0-a88e-d3a485b81481:6097231-6097440 is not subset of 146109a7-04bf-11f0-a88e-d3a485b81481:6097441-6097650 in binlog.078323 or vice versa
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 checking binlog.078322 (146109a7-04bf-11f0-a88e-d3a485b81481:6097231-6097440) against last uploaded set
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 last uploaded 146109a7-04bf-11f0-a88e-d3a485b81481:6097231-6097440 is equal to 146109a7-04bf-11f0-a88e-d3a485b81481:6097231-6097440 in binlog.078322
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 last uploaded binlog: binlog.078322
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:26 starting to process binlog with name binlog.078323
xxx-pitr-67b8f9dd85-9wmwt pitr 2025/12/03 13:59:27 successfully wrote binlog file binlog.078323 to storage with name binlog_1764769761_c872c26bb0cc9434aa2736184c1289f8

This looks good, in my opinion.

Yet, we get the gaps in the binlogs. There’s no way to identify when this happens, and there’s little to no documentation I could find on where I should look, or how I can make the process more verbose to debug this further. The documentation is very sparse on info on the PITR process, and possible tuning knobs for it. The errors we sometimes get from the PITR pod can’t be searched on the internet, there’s just no matching strings for it.

I want to try to reference this to restarts of pods, or other events in the cluster.

Also, restoring a PITR job takes a lot of time, while the amount of data it has to chew through is rather limited (this is a staging setup, it gets some bursts of usage when a release is prepped/tested, but it’s mostly idle). Restoring the full backup goes fast, putting back 100G of data in about 8 minutes, and preparing the database. However, the pod doing the “pitr recover” has 20% usage, 80% idle, and same goes for the cluster at that time. Seems like it’s waiting for something, but again, this process is very hard to debug due to the lack of documentation. Looking in the activity on the mysql server, there’s not too much happening, with a commit now and then.

Can someone point me to the documentation on the “pitr” binary in the job which restores the PITR (after the full restore), or anything which might help in identifying why we have slow restores and gaps, or just point me to the source code of the “pitr” binary…