Pitr service pod gets stuck when the binlog is large and cannot continue

When I tested pitr service, I found that if the binlog is too large, the service will be very slow.
I had set timeBetweenUploads to 1800s and I can see there are 2 binlog files(1G and 800M).
The log is here:

2022/05/27 02:45:43 run binlog collector
2022/05/27 02:45:45 Reading binlogs from pxc with hostname= mysql8-pxc-test-pxc-d-pxc-0.mysql8-pxc-test-pxc-d-pxc.rds.svc.cluster.local
2022/05/27 02:58:18 Starting to process binlog with name binlog.000004
2022/05/27 02:58:34 Successfully written binlog file binlog.000004 to s3 with name binlog_1653617848_7d6cff4163b46a2ed0d451a6398a6e71
2022/05/27 02:58:34 Starting to process binlog with name binlog.000005
2022/05/27 02:58:46 Successfully written binlog file binlog.000005 to s3 with name binlog_1653618192_32b772b956f7edd3023316e7a34aeb12

And I also found that when pitr service is working on syncing binlog, my sysbench test write/read will fail.

It is theoretically better to reduce the timeBetweenUploads, but this will lead to too many binlog files. How should I weigh the strategy in this situation?

2 Likes

Hello @cdmikechen,

what would be the problem with too many binlogs?

Also, I’m curios how come you have that big binlogs. What is the size of the database?

1 Like

@Sergey_Pronin
I just used sysbench to test percona operator (10000000 rows/test case). Please read my output log carefully. There is a very large event interval between reading and starting the binlog task. I don’t know what the service has done between these events.
If the binlog is big, the pitr service pod will olny show information Reading binlogs from pxc with hostname= xxxx and there is no subsequent continuous log output, which will cause me to be confused and doubt whether the service has stopped working.

Although I am not an expert in MySQL database, in my understanding, if binlog is generated every 30 seconds by default, it is a bad design strategy. Especially for some OLTP scenarios, there are many files in the same directory. The operation of this folder will be very inconvenient.
Can you explain to me the benefits of binlog generating every 30 seconds, please?

1 Like

Hello @Sergey_Pronin

I came up to this topic because I faced same issue and the issue is not trivial.
In my case I got timeBetweenUploads set to 60 sec and max_binlog_size in DB set to 100MB

  spec:
    backup:
      image: okd-registry.okd:5000/percona-xtradb-cluster-operator:1.13.0-pxc8.0-backup-pxb8.0.32
      pitr:
        enabled: true
        resources: {}
        storageName: bcp-s3
        timeBetweenUploads: 60

It is possible we had some big transaction which pumpup binlog much more than those 100MB.
It is not a problem to change binlog max size or upload frequency but our problem is it caused PITR pod hanging. When I realized last pitr log entry was some weeks old, I deleted pitr pod, which was recreated and collected all binlogs which were available at pxc pods at that moment causing binlog gap.

Recently we noticed during some of our pitr restore + recovery tests they are complaining about binlog gap.

What would be solution or at least temporary workaround for that problem?
I was not able to find a kind of livenes probe for binlog collector in pitr pod. Is it possible solution?

And another question, how binlog collector works when time for uploads is set to 60sec and we got one big transaction which is running e.g. 120sec writing it into actual binlog mass amount of transaction data? It skip binlog upload untill binlog is closed or will try to close the binlog and upload it into S3?

THX in advance

We’re also having this issue. With extra debug logging we see it gets stuck while streaming:

2024/11/19 18:06:47 Processing binlog 38/200: binlog.318770, size: 1073753561
2024/11/19 18:06:47 Starting manageBinlog for binlog.318770
2024/11/19 18:06:47 Retrieved first timestamp for binlog.318770
2024/11/19 18:06:47 Stream command: /usr/bin/mysqlbinlog -R -P 33062 --raw -hperconamysql-pxc-db-pxc-2.perconamysql-pxc-db-pxc.mynamespace.svc.cluster.local -uxtrabackup binlog.318770
2024/11/19 18:06:47 Started mysqlbinlog command for binlog.318770
<infinite-hang>

Has anyone solved this issue?