All nodes say safe_to_bootstrap: 0 in freshly deployed cluster

Jonathan_Dietrich · April 6, 2021, 3:13pm

Hello, I am trying to deploy MySQL 8 using the Kubernetes Operator (v1.7.0), but I cannot get it to run properly. After I kubectl apply my cluster-cr.yaml, The first database pod (cluster1-pxc-0) spins up and with cat /var/lib/mysql/grastate.dat inside the container can I see that safe_to_bootstrap is set to 1. After the first pod is successfully deployed, the second pod (cluster1-pxc-1) spins up and the safe_to_bootstrap value of cluster1-pxc-0 turns to 0. After the second pod is ready, a third pod spins up, and I end up with 3 nodes with safe_to_bootstrap: 0, which to my understanding means the cluster is not functional, as there is no replication between the nodes.

I played this through several times, with fresh nodes & volumes, I always get the same result.

This also shows in the backups, which fail similarly to the issue shown in this thread. In fact, I can also see that all 3 nodes are in “Donor/Desynced:” state.

When I describe my cluster crd, I see:

My cr.yaml is pasted here.

If anyone can give me a hint what I might be doing wrong, I would be very happy!

Best regards,

Jonathan

matthewb · April 6, 2021, 4:01pm

The grastate.dat file does NOT represent a “current status” of the cluster, so stop examining this file. This file is not updated while the cluster is running. The safe-to-bootstrap flag is set to 0 on start up of any node to prevent accidental bootstraps should this node die and restart. The flag is set to 1 only when the last node shuts down cleanly.

Jonathan_Dietrich · April 6, 2021, 4:32pm

Thanks for the quick answer, Matthew!

So putting the grastate file aside, the backup-pods are still failing, logging the following:

+ peer-list -on-start=/usr/bin/get-pxc-state -service=cluster1-pxc
2021/04/06 16:20:19 Peer finder enter
2021/04/06 16:20:19 Determined Domain to be percona-database.svc.cluster.local
2021/04/06 16:20:19 Peer list updated
was [ ]
now [cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local]
2021/04/06 16:20:19 execing: /usr/bin/get-pxc-state with stdin: cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local
cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local
cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local
2021/04/06 16:20:19
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary
2021/04/06 16:20:20 Peer finder exiting
[ERROR] Cannot find node for backup
+ echo ‘[ERROR] Cannot find node for backup’
+ exit 1

If I understand the thread that I linked in my original post correctly, the failure is caused by all db instances residing in state “Desynced” (which I mistakenly thought was caused by grastate.dat. Unfortunately, the thread gives no clue how to proceed. This behaviour is reproducible for my specific setup, i.e. I can delete the percona-cluster-cr from my cluster, rebuild the nodes, redeploy the cr and it will return to this state.

Do you have an idea how to troubleshoot this?

matthewb · April 6, 2021, 6:01pm

Unfortunately, I’m not well educated in K8S. It looks like there’s something missing with your secrets file? Did you create the right user and deploy that beforehand?

Jonathan_Dietrich · April 7, 2021, 7:44am

The “xtrabackup” entry is in my secrets file and I can also find the secret mounted to /etc/mysql/mysql-users-secret/xtrabackup in each of the database containers.

Following a suspicion I thought that maybe the secrets were unreadable for the Backup Container, but they are mounted rw-r--r--, so any user should be able to read.

Jonathan_Dietrich · April 28, 2021, 7:56am

After upgrading from 1.7.0 to 1.8.0, my problem persists:

All Backup Pods fail with the following logs:

now [cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local]
2021/04/28 00:01:43 execing: /usr/bin/get-pxc-state with stdin: cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local
cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local
cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local
2021/04/28 00:01:43
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Synced:wsrep_cluster_status:Primary:wsrep_cluster_size:3
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary:wsrep_cluster_size:3
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary:wsrep_cluster_size:3
2021/04/28 00:01:44 Peer finder exiting

Now I am not sure where exactly the backup pod looks for the password (as in, what exact location are the logs referring to), but I can see that the xtradb pod mounts the xtrabackup credentials as an ENV variable, not as a file (excerpt from kubectl describe pod):

Environment:
BACKUP_DIR: /backup
PXC_SERVICE: cluster1-pxc
PXC_PASS: <set to the key ‘xtrabackup’ in secret ‘my-cluster-secrets’> Optional: false
ACCESS_KEY_ID: <set to the key ‘AWS_ACCESS_KEY_ID’ in secret ‘s3-backup-credentials’> Optional: false
SECRET_ACCESS_KEY: <set to the key ‘AWS_SECRET_ACCESS_KEY’ in secret ‘s3-backup-credentials’> Optional: false
DEFAULT_REGION: dbl
ENDPOINT: https://REDACTED
S3_BUCKET: REDACTED
S3_BUCKET_PATH: cluster1-2021-04-28-00:00:34-full
Mounts:
/etc/mysql/ssl from ssl (rw)
/etc/mysql/ssl-internal from ssl-internal (rw)
/etc/mysql/vault-keyring-secret from vault-keyring-secret (rw)
/var/run/secrets/kubernetes.io/serviceaccount from percona-xtradb-cluster-operator-token-qzm5p (ro)

So it makes sense to me, that the file /etc/mysql/mysql-users-secret/xtrabackup cannot be found in that pod.

When I kubectl describe the pxc pod, I see the xtrabackup password is mounted in the environment (from internal-cluster1) and also mounted as a file (from mysql-secret-file):

Environment:
PXC_SERVICE: cluster1-pxc-unready
MONITOR_HOST: %
MYSQL_ROOT_PASSWORD: <set to the key ‘root’ in secret ‘internal-cluster1’> Optional: false
XTRABACKUP_PASSWORD: <set to the key ‘xtrabackup’ in secret ‘internal-cluster1’> Optional: false
MONITOR_PASSWORD: <set to the key ‘monitor’ in secret ‘internal-cluster1’> Optional: false
LOG_DATA_DIR: /var/lib/mysql
IS_LOGCOLLECTOR: yes
OPERATOR_ADMIN_PASSWORD: <set to the key ‘operator’ in secret ‘internal-cluster1’> Optional: false
Mounts:
/etc/my.cnf.d from auto-config (rw)
/etc/mysql/mysql-users-secret from mysql-users-secret-file (rw)
/etc/mysql/ssl from ssl (rw)
/etc/mysql/ssl-internal from ssl-internal (rw)
/etc/mysql/vault-keyring-secret from vault-keyring-secret (rw)
/etc/percona-xtradb-cluster.conf.d from config (rw)
/tmp from tmp (rw)
/var/lib/mysql from datadir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from percona-xtradb-cluster-operator-workload-token-rh5g6 (ro)

When I look at the /etc/mysql/mysql-users-secret directory in the pxc pod I see the xtrabackup SymLink and the mounted secret file which is readable for all users.

I fail to understand the nature of the error message that occurs in the backup pods. If someone could advise on its interpretation, that would be great.

darxmac · May 2, 2021, 12:06pm

I have the same issues on a freshly deployed 1.8 cluster:

cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster2-pxc-2.cluster2-pxc.pxc.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary:wsrep_cluster_size:3
2021/05/02 11:59:56 Peer finder exiting
+ echo '[ERROR] Cannot find node for backup'
+ exit 1
[ERROR] Cannot find node for backup

Slava_Sarzhan · May 10, 2021, 6:50pm

@Jonathan_Dietrich As you can see from the log you have three PXC nodes and two of them are in ‘Donor’ state(you have the backups in running state). Donor could not been chosen for backup. The first pxc-0 node has ‘Synced’ state but it is the first ‘primary’ pod which accepts the writes. So, it also can’t be used for backup.

Do you make the several backups at the same time?

Jonathan_Dietrich · May 11, 2021, 9:59am

Hello Slava, thank you for your further explanation.

I have a daily backup setup, but there should be no backups running in parallel as all the failing pods run consecutively and don’t take more than 24h to fail.

What I can gladly say is that starting last saturday, the backup pods ceased to fail, so I actually have working backups now. Sadly, I am not sure which change led to this. I know that last week I rechecked all parts of my S3 setup and realized that the bucket I wanted to backup to was nonexistent. But after I created it, the backups were still failing for at least 1 day. But now they work, which is great!

@darxmac maybe you have the same problem? Please check that the S3 bucket exists and that the s3 credentials are correct, this might be the solution for you as well.

If the bucket being nonexistent was in fact the root cause of this issue, maybe there could be a log entry that hints to this in a future version?

darxmac · May 12, 2021, 7:30am

Actually, i double checked my S3 setup and i did have a wrong endpoint url (using S3 compatible Linode) and after a while it started working. The failed attempts did leave the cluster in a state where two nodes were listed as Desynced is that normal ?

Slava_Sarzhan · May 12, 2021, 8:38am

@darxmac Let’s describe the situation, e.g. we have three attempts of the backups. Operator tries to make the backup several times if the first attempt fails. There can be a situation that the first and the second attempts fail due to e.g. wrong endpoint url (and in the log of these first backups you can find real root of the issue). Then when operator tries to perform the backup for the third time but nodes can still have state ‘Desynced’. Usually after the failed backup node changes the state to ‘Synced’ very quickly. If you want to understand how much time it exactly takes in your case you need to analyze all pxc and backup logs. But even if two out of three nodes have ‘Desynced’ state cluster continues to handle the traffic because one node will never be used for backups.

@Jonathan_Dietrich According to your description we have two different issues with the backups. The first issue was incorrect s3 bucket configuration and if you check the logs of the first and second attempts of the backups you can find this information in the logs. The second issue is ‘Desynced’ state
(and all logs provided were about this one that is why you can not find information that something is wrong with S3 setup) of the nodes and it also can be ok if interval between attempts is not big but node should have some time to become ‘Synced’.

Jonathan_Dietrich · May 12, 2021, 10:39am

Thank you again for your clarification! Yes, I could not find any logs that said anything about s3 configuration, but the reason for that might be that I did not look at the logs of all the failed backup pods.

Still, I don’t really know why the nodes went into ‘desynced’ state and why they came back to ‘synced’ state after I fixed my S3 config. Is there a correlation, or is this just coincidence?

Slava_Sarzhan · May 12, 2021, 5:30pm

If you set incorrect bucket name, the behaviour will be the following:
Operator tries to make the backup and then xbcloud tries to put it to the bucket and you have the following message ‘xbcloud: Failed to create bucket. Error message: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to’. While this time one of the pxc nodes has ‘Donor/Desynced’ state. If you have e.g. two backups in progress (in parallel) you will have two nodes in ‘Donor/Desynced’ .

Topic		Replies	Views
Error trying to create a backup Percona Operator for MySQL	7	2152	January 27, 2022
Percona XtraDB Operator version 8.0 fail to backup both s3 and local storage Percona Operator for MySQL percona	5	264	September 30, 2024
Cluster status and backups not working Percona Operator for MySQL	5	1060	October 28, 2021
Kubernetes: percona-xtradb-cluster-operator fails to initialize - readiness probe failed Percona Operator for MySQL	15	1888	February 16, 2023
Percona XtraDB Cluster on Kubernetes mysql operator (Point-In-Time Recovery) not happening Percona Operator for MySQL community , troubleshooting , mysql , percona	13	2428	April 18, 2022

All nodes say safe_to_bootstrap: 0 in freshly deployed cluster

Related topics