MySql cluster failed after catastrophic power failure and power-gen failed to feed datacenter

First of all, I’m a newbie for Linux, MySQL, Percona XtraDB Cluster, and all its kind. I’m from the pure Windows Servers & Networking field. So please bear with my stupid + long questions here, I’m trying hard to achieve Linux and MySQL skills now. :stuck_out_tongue:

Allow me to get to the point, the guy who set up the said MySQL cluster has left for months while the cluster keeps running without any issue. Unfortunately, last week, the main power went down unexpectedly while a storm hitting the city (our HQ campus) and our UPS cannot hold all the loads for such a long time, even worst, the power-generator also broken…and Bam! all infrastructure equipment been shutdown just like we pull out all the plugs.

We have 4 nodes (yes, 4 nodes) and one cluster interface server (int) running CentOS 7 + Percona XtraDB + HAProxy, two nodes (n1 & n2) sitting in our HQ campus were down, another two (n3 & n4) sitting at the branch campus were up at that time. And when power restored at HQ, all 3 servers (n1,n2, and int) refused to auto power on from the VMware vCenter setup. And I didn’t notice that until our automation script sent the alert to inform that it cannot perform the automation due to not able to access the DB through the interface server with port 3306.

I’ve tried many methods found in the Percona forum as well as from Expert-Exchange, Stack Overflow, and so on, no luck. The cluster seems completely dead and I am completely frustrated as no idea what to do. So I registered to this forum to seek help from the experts. Please kindly help me.

Any additional information needed, please do let me know.

Appreciated any support/advice you can provide.

That’s an unpleasant story.

To provide any response I would need to see log files from all nodes, and also what exactly you are doing and what errors do you see?

  1. Hello Vadimtk,

    Thank you for reaching my post and replied this.

    However, I found the resolution to rebuild and repair cluster already by:

    Searching for the highest SQNO on all the nodes.

  2. Power-off all nodes except the highest SQNO one.
  3. Execute command: systemctl mysql@bootstrap.service on the said highest SQNO node.
  4. Wait...and cross your fingers, until the MySQL service on the bootstrap node is up.
  5. Power-on the rest nodes "one at a time" and keep eyes on the cluster sync status. By execute command: clustercheck and cat /var/lib/mysql/grastate.dat to check the GALERA saved state information, etc.

*** My steps above may wrong or not practical same as when the Linux experts investigate the same issue, but this is what I found and it works for my case. :slight_smile: ***

Thank you

Teerayuth

I would not power-off nodes, just shutdown mysqld is enough, otherwise your steps is how I would do it.

Hello Vadimtk,

Thank you so much for your kind advice, I will remember that and follow the proper procedure next time.

Have a nice day!

Regards,

Teerayuth