PXC 8 auto-restart after graceful shutdown

I’ve been experimenting with a pretty stock XtraDB Cluster 8.0.30 install on Ubuntu 22.04, and have had a 3-node cluster up and running without much problem.

I’m interested in how much hand-holding PXC requires if things go bad in the server room, and tried gracefully shutting down all three nodes, and then bringing them back up.

I was hoping once they all could talk to each other again, the cluster would be functional again, but all three just sit there. Here’s /var/log/mysql/error.log output from the first node:

2023-01-31T03:08:36.587890Z 0 [Note] [MY-000000] [WSREP] Starting replication
2023-01-31T03:08:36.587908Z 0 [Note] [MY-000000] [Galera] Connecting with bootstrap option: 0
2023-01-31T03:08:36.587925Z 0 [Note] [MY-000000] [Galera] Setting GCS initial position to 0fff4c84-a109-11ed-a165-0ec17c0a2f1e:11
2023-01-31T03:08:36.587985Z 0 [Note] [MY-000000] [Galera] protonet asio version 0
2023-01-31T03:08:36.594760Z 0 [Note] [MY-000000] [Galera] Using CRC-32C for message checksums.
2023-01-31T03:08:36.594800Z 0 [Note] [MY-000000] [Galera] backend: asio
2023-01-31T03:08:36.594885Z 0 [Note] [MY-000000] [Galera] gcomm thread scheduling priority set to other:0
2023-01-31T03:08:36.594990Z 0 [Note] [MY-000000] [Galera] Fail to access the file (/var/lib/mysql//gvwstate.dat) error (No such file or directory). It is possible if node is booting for first time or re-booting after a graceful shutdown
2023-01-31T03:08:36.595009Z 0 [Note] [MY-000000] [Galera] Restoring primary-component from disk failed. Either node is booting for first time or re-booting after a graceful shutdown
2023-01-31T03:08:36.595167Z 0 [Note] [MY-000000] [Galera] GMCast version 0
2023-01-31T03:08:36.595334Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') listening at ssl://0.0.0.0:4567
2023-01-31T03:08:36.595353Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') multicast: , ttl: 1
2023-01-31T03:08:36.595604Z 0 [Note] [MY-000000] [Galera] EVS version 1
2023-01-31T03:08:36.595688Z 0 [Note] [MY-000000] [Galera] gcomm: connecting to group 'pxc-cluster', peer '10.66.0.111:,10.66.0.112:,10.66.0.113:'
2023-01-31T03:08:36.621682Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') connection established to 8bf1e9c6-bb17 ssl://10.66.0.113:4567
2023-01-31T03:08:36.624408Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') connection established to 8a096083-bd7d ssl://10.66.0.112:4567
2023-01-31T03:08:36.624587Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2023-01-31T03:08:36.624688Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address ssl://10.66.0.111:4567
2023-01-31T03:08:36.631984Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') connection established to 8bf1e9c6-bb17 ssl://10.66.0.113:4567
2023-01-31T03:08:37.098090Z 0 [Note] [MY-000000] [Galera] EVS version upgrade 0 -> 1
2023-01-31T03:08:37.098177Z 0 [Note] [MY-000000] [Galera] declaring 8a096083-bd7d at ssl://10.66.0.112:4567 stable
2023-01-31T03:08:37.098195Z 0 [Note] [MY-000000] [Galera] declaring 8bf1e9c6-bb17 at ssl://10.66.0.113:4567 stable
2023-01-31T03:08:37.098231Z 0 [Note] [MY-000000] [Galera] PC protocol upgrade 0 -> 1
2023-01-31T03:08:37.099285Z 0 [Warning] [MY-000000] [Galera] no nodes coming from prim view, prim not possible
2023-01-31T03:08:37.099332Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,8a096083-bd7d,3)
memb {
	8a096083-bd7d,0
	8bf1e9c6-bb17,0
	8da19da6-a9d8,0
	}
joined {
	}
left {
	}
partitioned {
	}
)
2023-01-31T03:08:40.097352Z 0 [Note] [MY-000000] [Galera] (8da19da6-a9d8, 'ssl://0.0.0.0:4567') turning message relay requesting off

The other two nodes have basically identical log messages, just differing in IP addresses.

The first node has in /var/lib/mysql/grastate.dat

# GALERA saved state
version: 2.1
uuid:    0fff4c84-a109-11ed-a165-0ec17c0a2f1e
seqno:   11
safe_to_bootstrap: 1

While #2 says:

# GALERA saved state
version: 2.1
uuid:    0fff4c84-a109-11ed-a165-0ec17c0a2f1e
seqno:   10
safe_to_bootstrap: 0

and #3 says

# GALERA saved state
version: 2.1
uuid:    0fff4c84-a109-11ed-a165-0ec17c0a2f1e
seqno:   9
safe_to_bootstrap: 0

So it seems like it can boot based on the first node, and if I manually run on that node:

systemctl stop mysql
systemctl start mysql@bootstrap

Then they all become happy and I see a cluster size of 3 with all nodes as “Primary”

My question is: is there anything that needs to be setup or enabled so that this can be automatic? So that if a node is “safe_to_bootstrap: 1” when the OS boots - that it actually starts “mysql@bootstrap” rather than just “mysql”?

I noticed in systemd that mysql@bootstrap is listed as:

Loaded: loaded (/lib/systemd/system/mysql@.service; disabled; vendor preset: enabled)

I thought it was weird that it was disabled even though the vendor-preset is enabled. I tried enabling it so both it and “mysql” started at boot, but that didn’t really help.

Thanks for any suggestions.

1 Like

Hi @barryp welcome to the Percona forums!

PXC is usually observed as an always-running system. Normally when you shut down a 3 node cluster one by one, the last remaining node maintains PRIMARY status and thus can continue to serve queries. If you choose to shut down the last server, you in fact need to start this instance back up in bootstrap mode. Bootstrap is the idea that there are no other cluster members to join and that this instance is forming a new cluster.

So you can try to automate the scenario, remember that the node with safe_to_bootstrap: 1 is the first to initiate with systemctl start mysql@bootstrap, then the other nodes get the regular systemctl start mysql

Thanks @Michael_Coburn , I took a stab at automating this and it seems to be working, I’ll share it here with my Ubuntu 22.04 setup in case someone wants to see or weigh in:

On each node (yay Ansible!) I added this script as /usr/local/sbin/choose-mysql-service.sh

#!/bin/bash

GRASTATE="/var/lib/mysql/grastate.dat"

service="mysql"

# Start a different service if grastate.dat is present
# with safe_to_bootstrap: 1
#
if [ -f $GRASTATE ]; then
    if grep --quiet "^safe_to_bootstrap: 1" $GRASTATE; then
        service="mysql@bootstrap"
    fi
fi

echo "Starting $service"
systemctl start $service

Then I added a one-shot systemd unit to execute at boot time, as /etc/systemd/system/choose-mysql-service.service

[Unit]
Description=Choose MySQL service
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/choose-mysql-service.sh
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

And the disabled the default mysql service and enabled my new unit with:

systemctl daemon-reload
systemctl disable mysql
systemctl enable choose-mysql-service

So now when the OS boots, instead of just blindly trying to start mysql, it looks at the grastate.dat and if it has safe_to_bootstrap: 1 it starts mysql@bootstrap instead - or otherwise falls back to the default of starting mysql

1 Like

very cool! oneshot is the best :slight_smile: