Why a disk full on a node kill cluster?

Hello

I have a cluster with 3 nodes (same hardware).
Each night i’m taking backup, i just desync one node, doing mysqldump and resync.
Last night, the backup filled all the space, one database growed too much.
After that, cluster stop working for 4 hours…

Log from node1 and node2 :

2024-10-06T01:01:43.829656Z 0 [Note] [MY-000000] [Galera] Member 1.0 (pxc7-hqn) desyncs itself from group
2024-10-06T01:29:35.506610Z 0 [Note] [MY-000000] [Galera] Member 1.0 (pxc7-hqn) resyncs itself to group.
2024-10-06T01:31:37.180598Z 216392627 [Warning] [MY-000000] [Server] Too many connections
2024-10-06T01:31:37.187622Z 216392628 [Warning] [MY-000000] [Server] Too many connections
2024-10-06T01:31:37.195104Z 216392629 [Warning] [MY-000000] [Server] Too many connections

Log from node 3 (backup) :

2024-10-06T01:13:18.643593Z 8272890 [Warning] [MY-000000] [WSREP] Percona-XtraDB-Cluster doesn't recommend use of LOCK TABLE/FLUSH TABLE <table> WITH READ LOCK/FOR EXPORT with pxc_strict_mode = PERMISSIVE
2024-10-06T01:13:22.697674Z 8272901 [Warning] [MY-000000] [WSREP] Percona-XtraDB-Cluster doesn't recommend use of LOCK TABLE/FLUSH TABLE <table> WITH READ LOCK/FOR EXPORT with pxc_strict_mode = PERMISSIVE
2024-10-06T01:34:55.747960Z 13 [ERROR] [MY-000035] [Server] Disk is full writing './mysql-bin.007955' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 secs.
2024-10-06T01:35:00.399507Z 0 [Note] [MY-000000] [Galera] Member 1.0 (pxc7-hqn) resyncs itself to group.
2024-10-06T01:35:00.400030Z 0 [Note] [MY-000000] [Galera] Shifting DONOR/DESYNCED -> JOINED (TO: 1035338093)
2024-10-06T01:35:00.400827Z 0 [Note] [MY-000000] [Galera] Processing event queue:...  0.0% (     0/149333 events) complete.
2024-10-06T01:44:55.887140Z 13 [ERROR] [MY-000035] [Server] Disk is full writing './mysql-bin.007955' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 secs.
[...]
2024-10-06T03:14:57.133072Z 13 [ERROR] [MY-0000352024-10-06T06:20:27.911543Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.36-28.1).
2024-10-06T06:20:27.911682Z 0 [Note] [MY-000000] [WSREP] Received shutdown signal. Will sleep for 10 secs before initiating shutdown. pxc_maint_mode switched to SHUTDOWN

When i juste reboot “node 3” after deleting some files, everything goes fine again, node1 & node2 came back online, and node3 did a SST to resync.

I’ll modify my script to not resync node 3 if disk is full after backup but why Galera / Percona dont handle this simple case ?

Thanks

Yathus

Just found this topic : Cluster freezes if one node's disk is full

If it’s expected…

Hi @Yathus,
Yep, this is expected behavior. Just because a node is out of disk space does not mean the node must abort. That node can still answer read-only queries. If you think that is strange, check out ‘innodb_read_only’ parameter which lets you run MySQL from a CDROM (100% no disk writes; same as disk full)

Be proactive and properly monitor your disk space with Percona Monitoring and Management.

@Yathus What I did on this is create a cluster wide DB and a single table that every minute or two it writes to this table and connects via localhost or 127.0.0.1 This table has a timestamp and a node name field.. It times the transaction and if the insert takes more than XXX seconds the local script force kills the mysql process (Yes I am aware this is generally not a good thing to do and it will more than likely require a SST to recover) I run this on all of my systems except the very last server in my “failover” chain (you can run it on all the systems but if I am down to one I would rather still have SOME reads work) This general process seems to work pretty well.. and helps in other cases where a cluster stall for some weird reason happens. It is one of my “last defense” items that is pretty drastic but works well as long as you understand what you are doing