We have a three-node cluster, on VMs using ExtremeIO storage for the data filesystem, which suffered a failure this morning. The event that triggered the failure appears to have been a storage-level error which caused node 3 to fail to create a new binlog file, in response to which mysqld declared that it was ceasing all logging. Some time afterward, nodes 1 and 2 experienced simultaneous failures to commit a set of updates, declared themselves inconsistent, and shut down, whereupon node 3 lost quorum and declared itself non-primary.
Galera does use ROW replication data, as we all know. At what level does Galera obtain the data, and at what level does logging get shut off in response to a storage-level failure as described here? Would mysqld disabling all logging cause Galera replication from node 3 to fail? Our working theory at present is that nodes 1 and 2 failed because the attempted to update rows which had been written by node 3, but never replicated to nodes 1 and 2 because the binary logging failure on node 3 also disabled outgoing Galera replication from node 3. Does this hypothesis make sense?