Can a XtraBackup process crash another node than the node it's backing up?

Lars_Erik_Dangvard_J · May 13, 2024, 11:12am

PXC has been running for some time and backups have run every night with no issues. But the last backup crashed the PXC. 2 out of a 3 node PXC 8.0 crashed during XtraBackup (xbcloud).

Running “–parallel=10” with xbcloud in this command:

xtrabackup --backup --host=NODE3 --port=3306 --user=BACKUPUSER --password=PASSWORD --stream=xbstream --read-buffer-size=200000000 --extra-lsndir=/checkpoint --target-dir=/checkpoint | xbcloud put --storage=s3 \
    --s3-endpoint='ENDPOINT' \
    --s3-access-key='ACCESSKEY' \
    --s3-secret-key='SECRETKEY' \
    --s3-bucket='BASE' \
    --parallel=10 \
    BASE

NODE3 is the node where the backup is running and PXC on NODE3 has 14 GB RAM assigned (small databases), and the separate backup process with --parallel=10 consumes more memory than the rest of the server has (16GB) and OOM-killed mysqld. This makes sense.

But NODE1 did also crash with OOM.

Can a XtraBackup process crash another node than the node it’s backing up?

matthewb · May 13, 2024, 6:09pm

--read-buffer-size=200000000 ← That’s nuts. Why do you have this? I’ve never seen this needed before even on 1+TB systems.

You have parallel=10 on the xbcloud, but not on the xtrabackup. You need parallel on both.

Lars_Erik_Dangvard_J · May 13, 2024, 6:31pm

Streaming to object storage is faster with high value. Default value of --read-buffer-size is very slow since stream is not speeding up (or it might with parallel on xtrabackup too?)

But can a high value or missing parameter on xtrabackup cause another node to crash also with a OOM kill in a very low PXC load period?

matthewb · May 13, 2024, 7:34pm

I’ve never seen/heard of that happening. Do you have PMM gathering metrics on the hosts to see if the increase in memory correlates to the start/duration of the backup?

Lars_Erik_Dangvard_J · May 14, 2024, 8:08am

We use netdata to collect metrics, and there’s a small spike in queries compared to almost none right before xtrabackup starts.

We’ll try adding more memory to avoid OOM kills, but I’m still wondering how and if the xtrabackup process has affected an OOM kill on the other node that wasn’t directly impacted by xtrabackup.

After restoring PXC memory usage is only around 7GB RAM on each node.

Some posts around state that cgroup management of PXC nodes can cause OOM kills since PXC might not respect allocated resources or work with it. So we’ll probably try disabling cgroups too (if anyone can confirm this please do).

Topic		Replies	Views
XtraDBCluster 1 Node Crash Percona XtraDB Cluster 5.x community , mysql , percona	13	127	February 11, 2025
Crash on xtrabackup completion Percona XtraDB Cluster 5.x	0	424	November 27, 2012
How Percona XtraBackup - 8.0.28-21 works during huge writes .......? Percona XtraBackup mysql , percona	1	526	June 2, 2022
PXC 5.7.32-35 node crashing (signal 11) during nightly delete operation Percona XtraDB Cluster 5.x	2	835	January 7, 2022
Falling PXC 5.6 after a successful SST Percona XtraDB Cluster 5.x	1	511	January 4, 2021

Can a XtraBackup process crash another node than the node it's backing up?

Related topics