PXC has been running for some time and backups have run every night with no issues. But the last backup crashed the PXC. 2 out of a 3 node PXC 8.0 crashed during XtraBackup (xbcloud).
Running “–parallel=10” with xbcloud in this command:
xtrabackup --backup --host=NODE3 --port=3306 --user=BACKUPUSER --password=PASSWORD --stream=xbstream --read-buffer-size=200000000 --extra-lsndir=/checkpoint --target-dir=/checkpoint | xbcloud put --storage=s3 \
--s3-endpoint='ENDPOINT' \
--s3-access-key='ACCESSKEY' \
--s3-secret-key='SECRETKEY' \
--s3-bucket='BASE' \
--parallel=10 \
BASE
NODE3 is the node where the backup is running and PXC on NODE3 has 14 GB RAM assigned (small databases), and the separate backup process with --parallel=10 consumes more memory than the rest of the server has (16GB) and OOM-killed mysqld. This makes sense.
But NODE1 did also crash with OOM.
Can a XtraBackup process crash another node than the node it’s backing up?
--read-buffer-size=200000000
← That’s nuts. Why do you have this? I’ve never seen this needed before even on 1+TB systems.
You have parallel=10
on the xbcloud, but not on the xtrabackup. You need parallel on both.
Streaming to object storage is faster with high value. Default value of --read-buffer-size is very slow since stream is not speeding up (or it might with parallel on xtrabackup too?)
But can a high value or missing parameter on xtrabackup cause another node to crash also with a OOM kill in a very low PXC load period?
I’ve never seen/heard of that happening. Do you have PMM gathering metrics on the hosts to see if the increase in memory correlates to the start/duration of the backup?
We use netdata to collect metrics, and there’s a small spike in queries compared to almost none right before xtrabackup starts.
We’ll try adding more memory to avoid OOM kills, but I’m still wondering how and if the xtrabackup process has affected an OOM kill on the other node that wasn’t directly impacted by xtrabackup.
After restoring PXC memory usage is only around 7GB RAM on each node.
Some posts around state that cgroup management of PXC nodes can cause OOM kills since PXC might not respect allocated resources or work with it. So we’ll probably try disabling cgroups too (if anyone can confirm this please do).