Cluster freezes if one node's disk is full

Oleksandr_Bezpiatov · April 30, 2022, 7:38pm

Hi,

It is not a secret that whole cluster becames frozen in case of any node goes out of disk space.

This behavior is described here: [PXC-1871] LP #1525300: Whole cluster freezes if one node goes full - Percona JIRA

And also there was a mention, that behavior is expcted so the issue “will not be fixed”.

But why this behaviour is “expected”? If one node fails, it means that transaction can be applied on all nodes except problematic one. This should run into CEV voting or something, IMO.

Who can advice either there are some workaround for this situation, either it is planned to be fixed? Thanks.

matthewb · May 2, 2022, 7:07pm

Hi @Oleksandr_Bezpiatov,
Just because a node is out of disk space does not mean it is a failed node. That node can still serve SELECT queries and if data is removed, that free space can be used for new writes. Since a node without free space is still a valid member to answer queries, then it must ack any new writes. Since it cannot ack writes, the cluster can’t move forward and stalls, but the node can still respond to normal heart beats which tells the other nodes that he’s OK and he’s a functioning member.

The best workaround is to be proactive and use PMM to monitor free disk space to alert before the issue arrises.

Oleksandr_Bezpiatov · May 3, 2022, 9:28am

Yep, we are monitoring free disk space proactively, but issues can still occur and disk space can be dramatically reduced at any time (that is our case).

IMO, PMM+alerting is not that much related to High Availability (that is Percona Cluster was made for), since High availability should follow the rule of minimized manual actions taken by human and maximizing automation.

Stall RW cluster because of the one node out of disk space – is a typical case when Cluster should act proactively to detect and resolve problematic node. Still it’s my opinion, but on live highloaded system such issues can still occur, even if we have dedicated monitoring SRE people (we have ones). In this case to unblock stalled cluster we need to shutdown mysql on problematic node manually, and this also takes some time to happen.

meyerder · July 16, 2025, 3:45pm

What I did on this is create a cluster wide DB and a single table that every minute or two it writes to this table and connects via localhost or 127.0.0.1 This table has a timestamp and a node name field.. It times the transaction and if the insert takes more than XXX seconds the local script force kills the mysql process (Yes I am aware this is generally not a good thing to do and it will more than likely require a SST to recover) I run this on all of my systems except the very last server in my “failover” chain (you can run it on all the systems but if I am down to one I would rather still have SOME reads work) This general process seems to work pretty well.. and helps in other cases where a cluster stall for some weird reason happens. It is one of my “last defense” items that is pretty drastic but works well as long as you understand what you are doing

Topic		Replies	Views
Why a disk full on a node kill cluster? Percona XtraDB Cluster 8.x	3	41	July 16, 2025
Percona XtraDB Cluster susceptible to disk-bound nodes Percona XtraDB Cluster 5.x	1	552	January 27, 2015
Percona XtraDB Cluster Ver 8.0.36-28.1 Cluster stalls Percona XtraDB Cluster 8.x	13	704	September 27, 2024
Recreate Percona Cluster when Disk is Full Percona XtraDB Cluster 5.x percona	1	546	May 6, 2022
Node has a disconnected status Percona XtraDB Cluster 8.x mysql , percona	2	722	July 11, 2022

Cluster freezes if one node's disk is full

Related topics