Hello, I am trying to deploy MySQL 8 using the Kubernetes Operator (v1.7.0), but I cannot get it to run properly. After I kubectl apply my cluster-cr.yaml, The first database pod (cluster1-pxc-0) spins up and with cat /var/lib/mysql/grastate.dat inside the container can I see that safe_to_bootstrap is set to 1. After the first pod is successfully deployed, the second pod (cluster1-pxc-1) spins up and the safe_to_bootstrap value of cluster1-pxc-0 turns to 0. After the second pod is ready, a third pod spins up, and I end up with 3 nodes with safe_to_bootstrap: 0, which to my understanding means the cluster is not functional, as there is no replication between the nodes.
I played this through several times, with fresh nodes & volumes, I always get the same result.
This also shows in the backups, which fail similarly to the issue shown in this thread. In fact, I can also see that all 3 nodes are in âDonor/Desynced:â state.
The grastate.dat file does NOT represent a âcurrent statusâ of the cluster, so stop examining this file. This file is not updated while the cluster is running. The safe-to-bootstrap flag is set to 0 on start up of any node to prevent accidental bootstraps should this node die and restart. The flag is set to 1 only when the last node shuts down cleanly.
So putting the grastate file aside, the backup-pods are still failing, logging the following:
+ peer-list -on-start=/usr/bin/get-pxc-state -service=cluster1-pxc
2021/04/06 16:20:19 Peer finder enter
2021/04/06 16:20:19 Determined Domain to be percona-database.svc.cluster.local
2021/04/06 16:20:19 Peer list updated
was [ ]
now [cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local]
2021/04/06 16:20:19 execing: /usr/bin/get-pxc-state with stdin: cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local
cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local
cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local
2021/04/06 16:20:19
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary
2021/04/06 16:20:20 Peer finder exiting
[ERROR] Cannot find node for backup
+ echo â[ERROR] Cannot find node for backupâ
+ exit 1
If I understand the thread that I linked in my original post correctly, the failure is caused by all db instances residing in state âDesyncedâ (which I mistakenly thought was caused by grastate.dat. Unfortunately, the thread gives no clue how to proceed. This behaviour is reproducible for my specific setup, i.e. I can delete the percona-cluster-cr from my cluster, rebuild the nodes, redeploy the cr and it will return to this state.
Unfortunately, Iâm not well educated in K8S. It looks like thereâs something missing with your secrets file? Did you create the right user and deploy that beforehand?
The âxtrabackupâ entry is in my secrets file and I can also find the secret mounted to /etc/mysql/mysql-users-secret/xtrabackup in each of the database containers.
Following a suspicion I thought that maybe the secrets were unreadable for the Backup Container, but they are mounted rw-r--r--, so any user should be able to read.
After upgrading from 1.7.0 to 1.8.0, my problem persists:
All Backup Pods fail with the following logs:
now [cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local]
2021/04/28 00:01:43 execing: /usr/bin/get-pxc-state with stdin: cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local
cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local
cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local
2021/04/28 00:01:43
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-0.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Synced:wsrep_cluster_status:Primary:wsrep_cluster_size:3
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-1.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary:wsrep_cluster_size:3
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster1-pxc-2.cluster1-pxc.percona-database.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary:wsrep_cluster_size:3
2021/04/28 00:01:44 Peer finder exiting
Now I am not sure where exactly the backup pod looks for the password (as in, what exact location are the logs referring to), but I can see that the xtradb pod mounts the xtrabackup credentials as an ENV variable, not as a file (excerpt from kubectl describe pod):
Environment:
BACKUP_DIR: /backup
PXC_SERVICE: cluster1-pxc
PXC_PASS: <set to the key âxtrabackupâ in secret âmy-cluster-secretsâ> Optional: false
ACCESS_KEY_ID: <set to the key âAWS_ACCESS_KEY_IDâ in secret âs3-backup-credentialsâ> Optional: false
SECRET_ACCESS_KEY: <set to the key âAWS_SECRET_ACCESS_KEYâ in secret âs3-backup-credentialsâ> Optional: false
DEFAULT_REGION: dbl
ENDPOINT: https://REDACTED
S3_BUCKET: REDACTED
S3_BUCKET_PATH: cluster1-2021-04-28-00:00:34-full
Mounts:
/etc/mysql/ssl from ssl (rw)
/etc/mysql/ssl-internal from ssl-internal (rw)
/etc/mysql/vault-keyring-secret from vault-keyring-secret (rw)
/var/run/secrets/kubernetes.io/serviceaccount from percona-xtradb-cluster-operator-token-qzm5p (ro)
So it makes sense to me, that the file /etc/mysql/mysql-users-secret/xtrabackup cannot be found in that pod.
When I kubectl describe the pxc pod, I see the xtrabackup password is mounted in the environment (from internal-cluster1) and also mounted as a file (from mysql-secret-file):
Environment:
PXC_SERVICE: cluster1-pxc-unready
MONITOR_HOST: %
MYSQL_ROOT_PASSWORD: <set to the key ârootâ in secret âinternal-cluster1â> Optional: false
XTRABACKUP_PASSWORD: <set to the key âxtrabackupâ in secret âinternal-cluster1â> Optional: false
MONITOR_PASSWORD: <set to the key âmonitorâ in secret âinternal-cluster1â> Optional: false
LOG_DATA_DIR: /var/lib/mysql
IS_LOGCOLLECTOR: yes
OPERATOR_ADMIN_PASSWORD: <set to the key âoperatorâ in secret âinternal-cluster1â> Optional: false
Mounts:
/etc/my.cnf.d from auto-config (rw)
/etc/mysql/mysql-users-secret from mysql-users-secret-file (rw)
/etc/mysql/ssl from ssl (rw)
/etc/mysql/ssl-internal from ssl-internal (rw)
/etc/mysql/vault-keyring-secret from vault-keyring-secret (rw)
/etc/percona-xtradb-cluster.conf.d from config (rw)
/tmp from tmp (rw)
/var/lib/mysql from datadir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from percona-xtradb-cluster-operator-workload-token-rh5g6 (ro)
When I look at the /etc/mysql/mysql-users-secret directory in the pxc pod I see the xtrabackup SymLink and the mounted secret file which is readable for all users.
I fail to understand the nature of the error message that occurs in the backup pods. If someone could advise on its interpretation, that would be great.
I have the same issues on a freshly deployed 1.8 cluster:
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
cat: /etc/mysql/mysql-users-secret/xtrabackup: No such file or directory
node:cluster2-pxc-2.cluster2-pxc.pxc.svc.cluster.local:wsrep_ready:ON:wsrep_connected:ON:wsrep_local_state_comment:Donor/Desynced:wsrep_cluster_status:Primary:wsrep_cluster_size:3
2021/05/02 11:59:56 Peer finder exiting
+ echo '[ERROR] Cannot find node for backup'
+ exit 1
[ERROR] Cannot find node for backup
@Jonathan_Dietrich As you can see from the log you have three PXC nodes and two of them are in âDonorâ state(you have the backups in running state). Donor could not been chosen for backup. The first pxc-0 node has âSyncedâ state but it is the first âprimaryâ pod which accepts the writes. So, it also canât be used for backup.
Hello Slava, thank you for your further explanation.
I have a daily backup setup, but there should be no backups running in parallel as all the failing pods run consecutively and donât take more than 24h to fail.
What I can gladly say is that starting last saturday, the backup pods ceased to fail, so I actually have working backups now. Sadly, I am not sure which change led to this. I know that last week I rechecked all parts of my S3 setup and realized that the bucket I wanted to backup to was nonexistent. But after I created it, the backups were still failing for at least 1 day. But now they work, which is great!
@darxmac maybe you have the same problem? Please check that the S3 bucket exists and that the s3 credentials are correct, this might be the solution for you as well.
If the bucket being nonexistent was in fact the root cause of this issue, maybe there could be a log entry that hints to this in a future version?
Actually, i double checked my S3 setup and i did have a wrong endpoint url (using S3 compatible Linode) and after a while it started working. The failed attempts did leave the cluster in a state where two nodes were listed as Desynced is that normal ?
@darxmac Letâs describe the situation, e.g. we have three attempts of the backups. Operator tries to make the backup several times if the first attempt fails. There can be a situation that the first and the second attempts fail due to e.g. wrong endpoint url (and in the log of these first backups you can find real root of the issue). Then when operator tries to perform the backup for the third time but nodes can still have state âDesyncedâ. Usually after the failed backup node changes the state to âSyncedâ very quickly. If you want to understand how much time it exactly takes in your case you need to analyze all pxc and backup logs. But even if two out of three nodes have âDesyncedâ state cluster continues to handle the traffic because one node will never be used for backups.
@Jonathan_Dietrich According to your description we have two different issues with the backups. The first issue was incorrect s3 bucket configuration and if you check the logs of the first and second attempts of the backups you can find this information in the logs. The second issue is âDesyncedâ state
(and all logs provided were about this one that is why you can not find information that something is wrong with S3 setup) of the nodes and it also can be ok if interval between attempts is not big but node should have some time to become âSyncedâ.
Thank you again for your clarification! Yes, I could not find any logs that said anything about s3 configuration, but the reason for that might be that I did not look at the logs of all the failed backup pods.
Still, I donât really know why the nodes went into âdesyncedâ state and why they came back to âsyncedâ state after I fixed my S3 config. Is there a correlation, or is this just coincidence?
If you set incorrect bucket name, the behaviour will be the following:
Operator tries to make the backup and then xbcloud tries to put it to the bucket and you have the following message âxbcloud: Failed to create bucket. Error message: The unspecified location constraint is incompatible for the region specific endpoint this request was sent toâ. While this time one of the pxc nodes has âDonor/Desyncedâ state. If you have e.g. two backups in progress (in parallel) you will have two nodes in âDonor/Desyncedâ .