Why a disk full on a node kill cluster?

Hello

I have a cluster with 3 nodes (same hardware).
Each night i’m taking backup, i just desync one node, doing mysqldump and resync.
Last night, the backup filled all the space, one database growed too much.
After that, cluster stop working for 4 hours…

Log from node1 and node2 :

2024-10-06T01:01:43.829656Z 0 [Note] [MY-000000] [Galera] Member 1.0 (pxc7-hqn) desyncs itself from group
2024-10-06T01:29:35.506610Z 0 [Note] [MY-000000] [Galera] Member 1.0 (pxc7-hqn) resyncs itself to group.
2024-10-06T01:31:37.180598Z 216392627 [Warning] [MY-000000] [Server] Too many connections
2024-10-06T01:31:37.187622Z 216392628 [Warning] [MY-000000] [Server] Too many connections
2024-10-06T01:31:37.195104Z 216392629 [Warning] [MY-000000] [Server] Too many connections

Log from node 3 (backup) :

2024-10-06T01:13:18.643593Z 8272890 [Warning] [MY-000000] [WSREP] Percona-XtraDB-Cluster doesn't recommend use of LOCK TABLE/FLUSH TABLE <table> WITH READ LOCK/FOR EXPORT with pxc_strict_mode = PERMISSIVE
2024-10-06T01:13:22.697674Z 8272901 [Warning] [MY-000000] [WSREP] Percona-XtraDB-Cluster doesn't recommend use of LOCK TABLE/FLUSH TABLE <table> WITH READ LOCK/FOR EXPORT with pxc_strict_mode = PERMISSIVE
2024-10-06T01:34:55.747960Z 13 [ERROR] [MY-000035] [Server] Disk is full writing './mysql-bin.007955' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 secs.
2024-10-06T01:35:00.399507Z 0 [Note] [MY-000000] [Galera] Member 1.0 (pxc7-hqn) resyncs itself to group.
2024-10-06T01:35:00.400030Z 0 [Note] [MY-000000] [Galera] Shifting DONOR/DESYNCED -> JOINED (TO: 1035338093)
2024-10-06T01:35:00.400827Z 0 [Note] [MY-000000] [Galera] Processing event queue:...  0.0% (     0/149333 events) complete.
2024-10-06T01:44:55.887140Z 13 [ERROR] [MY-000035] [Server] Disk is full writing './mysql-bin.007955' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 secs.
[...]
2024-10-06T03:14:57.133072Z 13 [ERROR] [MY-0000352024-10-06T06:20:27.911543Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.36-28.1).
2024-10-06T06:20:27.911682Z 0 [Note] [MY-000000] [WSREP] Received shutdown signal. Will sleep for 10 secs before initiating shutdown. pxc_maint_mode switched to SHUTDOWN

When i juste reboot “node 3” after deleting some files, everything goes fine again, node1 & node2 came back online, and node3 did a SST to resync.

I’ll modify my script to not resync node 3 if disk is full after backup but why Galera / Percona dont handle this simple case ?

Thanks

Yathus

Just found this topic : Cluster freezes if one node's disk is full

If it’s expected…

Hi @Yathus,
Yep, this is expected behavior. Just because a node is out of disk space does not mean the node must abort. That node can still answer read-only queries. If you think that is strange, check out ‘innodb_read_only’ parameter which lets you run MySQL from a CDROM (100% no disk writes; same as disk full)

Be proactive and properly monitor your disk space with Percona Monitoring and Management.