Not the answer you need?
Register and ask your own question!

Cluster failure following a filesystem-level error

unixroninunixronin ContributorCurrent User Role Beginner
We have a three-node cluster, on VMs using ExtremeIO storage for the data filesystem, which suffered a failure this morning. The event that triggered the failure appears to have been a storage-level error which caused node 3 to fail to create a new binlog file, in response to which mysqld declared that it was ceasing all logging. Some time afterward, nodes 1 and 2 experienced simultaneous failures to commit a set of updates, declared themselves inconsistent, and shut down, whereupon node 3 lost quorum and declared itself non-primary.

Galera does use ROW replication data, as we all know. At what level does Galera obtain the data, and at what level does logging get shut off in response to a storage-level failure as described here? Would mysqld disabling all logging cause Galera replication from node 3 to fail? Our working theory at present is that nodes 1 and 2 failed because the attempted to update rows which had been written by node 3, but never replicated to nodes 1 and 2 because the binary logging failure on node 3 also disabled outgoing Galera replication from node 3. Does this hypothesis make sense?

Comments

  • przemekprzemek Percona Support Engineer Percona Staff Role
    First of all, binary logs are not required in Galera, but are useful for PITR-capable backups for example. Still, binary logs are not used for replication in Galera cluster.
    If nodes 1 and 2 failed because they could not write to the disk, then it's normal they had to abort. Even standalone MySQL+InnoDB will not work with filesystem being in read-only mode, unless it is specifically prepared for such case before.
    As two nodes out of three failed in unclean way, the remaining node, even if could be healthy in terms of hardware, had to stop accepting queries as it lost the quorum. But if this 3rd node was still OK, you could force it to be primary again by manually bootstrapping it (possible to do it online).

    The idea of High Availability with Galera (PXC) is that each node should run on independent hardware. So, a single storage-level failure should NOT affect majority of the nodes in the same time.
Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.