Auto recover of 3 VM based Percona Database after abrupt reboots

Hi Team,

I have 3 node Percona XtraDB Cluster(MySQL 8.X) which each DB VM running on 3 different hosts. When i abruptly reboot two nodes(like reboot 2 nodes parallelly), the DB cluster is broken and database is unavailable. When checked, the two of the mysql service on them are in failed state.

Below is my.cnf which is there in all the nodes.

root@cp-db1:/var/lib# cat  /etc/mysql/mysql.conf.d/mysqld.cnf
[client]
socket=/var/run/mysqld/mysqld.sock

[mysqld]
ssl-cert=/etc/mysql/ssl/tls.pem
ssl-key=/etc/mysql/ssl/key.pem
datadir=/var/lib/mysql
socket=/var/run/mysqld/mysqld.sock
log-error=/var/log/mysql/error.log
pid-file=/var/run/mysqld/mysqld.pid
bind-address=0.0.0.0
port=3306
binlog_expire_logs_seconds=604800
pxc-encrypt-cluster-traffic=OFF
wsrep_provider=/usr/lib/galera4/libgalera_smm.so

wsrep_cluster_address=gcomm://<ip1>,<ip2>,<ip3>
binlog_format=ROW
#skip-name-resolve
innodb_autoinc_lock_mode=2
wsrep_node_name=cp-db1
wsrep_node_address=<ip1>
wsrep_cluster_name=morpheus
default_storage_engine=InnoDB
wsrep_sync_wait=2
wsrep_provider_options="cert.optimistic_pa=NO"
wsrep_certification_rules="OPTIMIZED"
pxc_strict_mode=PERMISSIVE
wsrep_sst_method=xtrabackup-v2
default_time_zone="+00:00"
max_connections=3001
sql_generate_invisible_primary_key=ON
ssl_fips_mode=ON

The only way i can bring up the DB cluster back is by bootstrapping the DB nodes. Is there a way to implement auto recover by setting a configuration in my.cnf ? is auto recover recommended in multi master replication DB ? Does it cause data corruption ?

Really appreciate your quick recommendations on the same.

Yes, that is 100% expected behavior. You removed 2 of the 3 nodes from the cluster. That is less than 50% of the needed majority.

Your cluster is now in non-Primary state and needs to be bootstrapped. Correct. There is no way to auto-recover from this.

Basically, don’t reboot the majority of your nodes at the same time. If you need to do this, perform a graceful mysql shutdown first. Doing so reduces the quorum calculation.

Thank you so much for the quick response. I would agree with you on the fact that 2 of the 3 nodes are down which would break the majority and hence the cluster is broken. The situation we have is not user triggered and system triggered which did reboot both the nodes. During planned maintenance, we are gracefully bringing down the services.

I was looking for any options or automated ways to recover (like we do it in operator for pods percona-xtradb-cluster-operator/build/pxc-entrypoint.sh at fc46e369c9cc1bfca4552fdc6b204f9b5b243227 · percona/percona-xtradb-cluster-operator · GitHub) post sudden reboots. I will also explore if there are ways to do that in the meantime.