It is not a secret that whole cluster becames frozen in case of any node goes out of disk space.
This behavior is described here: [PXC-1871] LP #1525300: Whole cluster freezes if one node goes full - Percona JIRA
And also there was a mention, that behavior is expcted so the issue “will not be fixed”.
But why this behaviour is “expected”? If one node fails, it means that transaction can be applied on all nodes except problematic one. This should run into CEV voting or something, IMO.
Who can advice either there are some workaround for this situation, either it is planned to be fixed? Thanks.
Just because a node is out of disk space does not mean it is a failed node. That node can still serve SELECT queries and if data is removed, that free space can be used for new writes. Since a node without free space is still a valid member to answer queries, then it must ack any new writes. Since it cannot ack writes, the cluster can’t move forward and stalls, but the node can still respond to normal heart beats which tells the other nodes that he’s OK and he’s a functioning member.
The best workaround is to be proactive and use PMM to monitor free disk space to alert before the issue arrises.
Yep, we are monitoring free disk space proactively, but issues can still occur and disk space can be dramatically reduced at any time (that is our case).
IMO, PMM+alerting is not that much related to High Availability (that is Percona Cluster was made for), since High availability should follow the rule of minimized manual actions taken by human and maximizing automation.
Stall RW cluster because of the one node out of disk space – is a typical case when Cluster should act proactively to detect and resolve problematic node. Still it’s my opinion, but on live highloaded system such issues can still occur, even if we have dedicated monitoring SRE people (we have ones). In this case to unblock stalled cluster we need to shutdown mysql on problematic node manually, and this also takes some time to happen.