PXC 8 auto-restart after graceful shutdown

barryp · January 31, 2023, 3:39am

I’ve been experimenting with a pretty stock XtraDB Cluster 8.0.30 install on Ubuntu 22.04, and have had a 3-node cluster up and running without much problem.

I’m interested in how much hand-holding PXC requires if things go bad in the server room, and tried gracefully shutting down all three nodes, and then bringing them back up.

I was hoping once they all could talk to each other again, the cluster would be functional again, but all three just sit there. Here’s /var/log/mysql/error.log output from the first node:

2023-01-31T03:08:36.587890Z 0 [Note] [MY-000000] [WSREP] Starting replication
2023-01-31T03:08:36.587908Z 0 [Note] [MY-000000] [Galera] Connecting with bootstrap option: 0
2023-01-31T03:08:36.587925Z 0 [Note] [MY-000000] [Galera] Setting GCS initial position to 0fff4c84-a109-11ed-a165-0ec17c0a2f1e:11
2023-01-31T03:08:36.587985Z 0 [Note] [MY-000000] [Galera] protonet asio version 0
2023-01-31T03:08:36.594760Z 0 [Note] [MY-000000] [Galera] Using CRC-32C for message checksums.
2023-01-31T03:08:36.594800Z 0 [Note] [MY-000000] [Galera] backend: asio
2023-01-31T03:08:36.594885Z 0 [Note] [MY-000000] [Galera] gcomm thread scheduling priority set to other:0
2023-01-31T03:08:36.594990Z 0 [Note] [MY-000000] [Galera] Fail to access the file (/var/lib/mysql//gvwstate.dat) error (No such file or directory). It is possible if node is booting for first time or re-booting after a graceful shutdown
2023-01-31T03:08:36.595009Z 0 [Note] [MY-000000] [Galera] Restoring primary-component from disk failed. Either node is booting for first time or re-booting after a graceful shutdown
2023-01-31T03:08:36.595167Z 0 [Note] [MY-000000] [Galera] GMCast version 0
2023-01-31T03:08:36.595334Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') listening at ssl://0.0.0.0:4567
2023-01-31T03:08:36.595353Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') multicast: , ttl: 1
2023-01-31T03:08:36.595604Z 0 [Note] [MY-000000] [Galera] EVS version 1
2023-01-31T03:08:36.595688Z 0 [Note] [MY-000000] [Galera] gcomm: connecting to group 'pxc-cluster', peer '10.66.0.111:,10.66.0.112:,10.66.0.113:'
2023-01-31T03:08:36.621682Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') connection established to 8bf1e9c6-bb17 ssl://10.66.0.113:4567
2023-01-31T03:08:36.624408Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') connection established to 8a096083-bd7d ssl://10.66.0.112:4567
2023-01-31T03:08:36.624587Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2023-01-31T03:08:36.624688Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address ssl://10.66.0.111:4567
2023-01-31T03:08:36.631984Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') connection established to 8bf1e9c6-bb17 ssl://10.66.0.113:4567
2023-01-31T03:08:37.098090Z 0 [Note] [MY-000000] [Galera] EVS version upgrade 0 -> 1
2023-01-31T03:08:37.098177Z 0 [Note] [MY-000000] [Galera] declaring 8a096083-bd7d at ssl://10.66.0.112:4567 stable
2023-01-31T03:08:37.098195Z 0 [Note] [MY-000000] [Galera] declaring 8bf1e9c6-bb17 at ssl://10.66.0.113:4567 stable
2023-01-31T03:08:37.098231Z 0 [Note] [MY-000000] [Galera] PC protocol upgrade 0 -> 1
2023-01-31T03:08:37.099285Z 0 [Warning] [MY-000000] [Galera] no nodes coming from prim view, prim not possible
2023-01-31T03:08:37.099332Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,8a096083-bd7d,3)
memb {
	8a096083-bd7d,0
	8bf1e9c6-bb17,0
	8da19da6-a9d8,0
	}
joined {
	}
left {
	}
partitioned {
	}
)
2023-01-31T03:08:40.097352Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') turning message relay requesting off

The other two nodes have basically identical log messages, just differing in IP addresses.

The first node has in /var/lib/mysql/grastate.dat

# GALERA saved state
version: 2.1
uuid:    0fff4c84-a109-11ed-a165-0ec17c0a2f1e
seqno:   11
safe_to_bootstrap: 1

While #2 says:

# GALERA saved state
version: 2.1
uuid:    0fff4c84-a109-11ed-a165-0ec17c0a2f1e
seqno:   10
safe_to_bootstrap: 0

and #3 says

# GALERA saved state
version: 2.1
uuid:    0fff4c84-a109-11ed-a165-0ec17c0a2f1e
seqno:   9
safe_to_bootstrap: 0

So it seems like it can boot based on the first node, and if I manually run on that node:

systemctl stop mysql
systemctl start mysql@bootstrap

Then they all become happy and I see a cluster size of 3 with all nodes as “Primary”

My question is: is there anything that needs to be setup or enabled so that this can be automatic? So that if a node is “safe_to_bootstrap: 1” when the OS boots - that it actually starts “mysql@bootstrap” rather than just “mysql”?

I noticed in systemd that mysql@bootstrap is listed as:

Loaded: loaded (/lib/systemd/system/mysql@.service; disabled; vendor preset: enabled)

I thought it was weird that it was disabled even though the vendor-preset is enabled. I tried enabling it so both it and “mysql” started at boot, but that didn’t really help.

Thanks for any suggestions.

Michael_Coburn · January 31, 2023, 4:21pm

Hi @barryp welcome to the Percona forums!

PXC is usually observed as an always-running system. Normally when you shut down a 3 node cluster one by one, the last remaining node maintains PRIMARY status and thus can continue to serve queries. If you choose to shut down the last server, you in fact need to start this instance back up in bootstrap mode. Bootstrap is the idea that there are no other cluster members to join and that this instance is forming a new cluster.

So you can try to automate the scenario, remember that the node with safe_to_bootstrap: 1 is the first to initiate with systemctl start mysql@bootstrap, then the other nodes get the regular systemctl start mysql

barryp · February 1, 2023, 6:09pm

Thanks @Michael_Coburn , I took a stab at automating this and it seems to be working, I’ll share it here with my Ubuntu 22.04 setup in case someone wants to see or weigh in:

On each node (yay Ansible!) I added this script as /usr/local/sbin/choose-mysql-service.sh

#!/bin/bash

GRASTATE="/var/lib/mysql/grastate.dat"

service="mysql"

# Start a different service if grastate.dat is present
# with safe_to_bootstrap: 1
#
if [ -f $GRASTATE ]; then
    if grep --quiet "^safe_to_bootstrap: 1" $GRASTATE; then
        service="mysql@bootstrap"
    fi
fi

echo "Starting $service"
systemctl start $service

Then I added a one-shot systemd unit to execute at boot time, as /etc/systemd/system/choose-mysql-service.service

[Unit]
Description=Choose MySQL service
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/choose-mysql-service.sh
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

And the disabled the default mysql service and enabled my new unit with:

systemctl daemon-reload
systemctl disable mysql
systemctl enable choose-mysql-service

So now when the OS boots, instead of just blindly trying to start mysql, it looks at the grastate.dat and if it has safe_to_bootstrap: 1 it starts mysql@bootstrap instead - or otherwise falls back to the default of starting mysql

Michael_Coburn · February 2, 2023, 2:40am

very cool! oneshot is the best

Topic		Replies	Views
3 node cluster - gracefully shutdown ALL nodes - cluster fail to start Percona XtraDB Cluster 5.x	3	6565	July 21, 2014
Problem reboot all node Percona XtraDB Cluster 5.x	5	9567	January 16, 2019
How to reset Percona XtraDB Cluster on all nodes? Percona XtraDB Cluster 5.x	8	9111	December 12, 2016
Full SST after clean shutdown of all nodes in the cluster and bootstrapping one node Percona XtraDB Cluster 5.x	4	2022	April 23, 2015
cluster down all nodes. how to config auto start? Percona XtraDB Cluster 5.x	1	1347	September 18, 2017

PXC 8 auto-restart after graceful shutdown

Related topics