Unable to Restore PXC - shutdown pods: exceeded wait limit

Hi,

I tried to restore my PXC with a backup. The cluster I want to restore is the one from which the backup was made.

The version of running percona mysql operator is: 1.11.0

The PXC was installed with helm chart depending on :

dependencies:
- name: pxc-db 
  version: 1.11.5
  repository: https://percona.github.io/percona-helm-charts

with the following values:

pxc-db:
  finalizers:
    - delete-pxc-pods-in-order
    - delete-proxysql-pvc
    - delete-pxc-pvc
  fullnameOverride: testcluster
  pxc:
    expose:
      enabled: false
    persistence:
      enabled: true
      size: 10Gi
      storageClass: hcloud-volumes
    disableTLS: false
  resources:
    limits:
      memory: 1G
      cpu: 600m
  backup:
    enabled: true
    pitr:
      enabled: true
      storageName: devscr-s3-pitr
      timeBetweenUploads: 60
    storages:
      devscr-s3:
        type: s3
        s3:
          credentialsSecret: s3-backup-creds
          region: ''
          bucket: percona
          endpointUrl: https://$DOMAIN:443
      devscr-s3-pitr:
        type: s3
        s3:
          credentialsSecret: s3-backup-creds
          region: ''
          bucket: percona-pitr
          endpointUrl: https:/$DOMAIN:443
  haproxy:
    enabled: true
    size: 2

The backup was created with the following CR:

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterBackup
metadata:
  name: manual-backup-20221114-1345
spec:
  pxcCluster: testcluster
  storageName: devscr-s3

The backup was created without any problems and is also listed in the result of kubectl get pxc-backup

NAME                                        CLUSTER       STORAGE     DESTINATION                                         STATUS      COMPLETED   AGE
manual-backup-20221110-1550                 testcluster   devscr-s3   s3://percona/testcluster-2022-11-10-14:50:01-full   Succeeded   4d1h        4d1h
manual-backup-20221114-1345                 testcluster   devscr-s3   s3://percona/testcluster-2022-11-14-12:44:36-full   Succeeded   4h3m        4h3m

My next step was triggering the restore with the following:

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterRestore
metadata:
  name: rollback-cluster-to-specific-timestamp
spec:
  pxcCluster: testcluster
  backupName: manual-backup-20221114-1345 

So, I was referencing the backup created some minutes ago.

The object was created successfully. But from now there are at least two kind of problems:
1st: sometimes I can spot the same behavior like reported in Unable to Restore PXC, that the restore process hangs and there is no status reported - but after killing operator pod the status changed to ‘stopping cluster’…

2nd: after changing status to ‘stopping cluster’ nearly nothing happened. there is no message providing some hints/problems in the operator log. the only information I got, is the status in PXCrestore-object:

status:
  comments: 'stop cluster testcluster: shutdown pods: exceeded wait limit'
  state: Failed

I added the operator log, just for information.
percona_mysql_op.log (9.7 KB)

My questions:
What can I do to get more information for debugging?
Can I enable debug log in operator pod?
Is there a mistake in my configuration?
What can cause this behavior?

thanks in advance for any additional information providing answer.

kind regards

fgo

1 Like

hey @fgo ,

just for clarity - I cannot reproduce it.

Can you please share anything specific about your setup? Storage type, size, etc?

Also if you can please show the following when these things are happening?

kubectl get pxc-restore -o yaml
kubectl get pods

1 Like

@Sergey_Pronin

Seconding this. Almost the same happens to me.
Running the pxc cluster on an Openstack Kubernetes cluster version 1.23.14.
The nodes use Ubuntu Focal 20.04 (2022-12-14) images.
PXC-DB got installed using the official helm chart using custom values provided.

pxc:
  clusterSecretName: percona-secrets
  persistence:
    size: 30Gi

haproxy:
  serviceAnnotations:
    service.beta.kubernetes.io/openstack-internal-load-balancer: "true"
  serviceType: "LoadBalancer"

backup:
  storages:
    s3:
      type: s3
      s3:
        credentialsSecret: db-backup-s3-credentials
        endpointUrl: https://s3.fes.cloud.syseleven.net
        region: fes
        bucket: db-backups-dev
  schedule:
  - keep: 7
    name: daily-backup-s3
    schedule: 0 0 * * *
    storageName: s3
  - keep: 3
    name: daily-backup-filesystem
    schedule: 0 0 * * *
    storageName: fs-pvc

The backup is successfully created in an S3 storage, but cannot be restored.

Outputs for:
kubectl get pxc-restore -o yaml -n database

apiVersion: v1
items:
- apiVersion: pxc.percona.com/v1
  kind: PerconaXtraDBClusterRestore
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"pxc.percona.com/v1","kind":"PerconaXtraDBClusterRestore","metadata":{"annotations":{},"name":"restore-after-testing","namespace":"database"},"spec":{"backupName":"manual-backup-test","pxcCluster":"cluster-db-pxc-db"}}
    creationTimestamp: "2022-12-22T12:00:50Z"
    generation: 1
    name: restore-after-testing
    namespace: database
    resourceVersion: "4313144"
    uid: 8c74eb71-aba4-4b9e-8d52-dc4b4080ab15
  spec:
    backupName: manual-backup-test
    pxcCluster: cluster-db-pxc-db
  status:
    comments: 'stop cluster cluster-db-pxc-db: shutdown pvc: exceeded wait limit'
    state: Failed
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

kubectl get pxc-backup manual-backup-test -o yaml -n database

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterBackup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"pxc.percona.com/v1","kind":"PerconaXtraDBClusterBackup","metadata":{"annotations":{},"name":"manual-backup-test","namespace":"database"},"spec":{"pxcCluster":"cluster-db-pxc-db","storageName":"s3"}}
  creationTimestamp: "2022-12-22T11:58:16Z"
  generation: 1
  name: manual-backup-test
  namespace: database
  resourceVersion: "4310006"
  uid: 5e4628dc-8d08-468f-9bbf-eb68907e7154
spec:
  pxcCluster: cluster-db-pxc-db
  storageName: s3
status:
  completed: "2022-12-22T11:58:54Z"
  destination: s3://db-backups-dev/cluster-db-pxc-db-2022-12-22-11:58:16-full
  image: percona/percona-xtradb-cluster-operator:1.12.0-pxc8.0-backup
  s3:
    bucket: db-backups-dev
    credentialsSecret: db-backup-s3-credentials
    endpointUrl: https://s3.fes.cloud.syseleven.net
    region: fes
  sslInternalSecretName: cluster-db-pxc-db-ssl-internal
  sslSecretName: cluster-db-pxc-db-ssl
  state: Succeeded
  storage_type: s3
  storageName: s3
  vaultSecretName: cluster-db-pxc-db-vault

kubectl get pods -n database

kubectl get pods -n database
NAME                            READY   STATUS      RESTARTS   AGE
cluster-db-pxc-db-haproxy-0     2/2     Running     0          25m
cluster-db-pxc-db-haproxy-1     2/2     Running     0          24m
cluster-db-pxc-db-haproxy-2     2/2     Running     0          23m
cluster-db-pxc-db-pxc-0         3/3     Running     0          25m
cluster-db-pxc-db-pxc-1         3/3     Running     0          24m
cluster-db-pxc-db-pxc-2         3/3     Running     0          23m
pxc-operator-575bcbcbd5-m8mqr   1/1     Running     0          128m
xb-manual-backup-test-bw49p     0/1     Completed   0          39m

Keep in mind that during the restore process all pxc and haproxy pods were deleted since the process seemingly completed the ‘Stopping Cluster’ step. While the restore process was stuck at ‘Stopping Cluster’ the cluster itself had the Status ‘Paused’.
I’d also like to know if there is a way to gather logs of the restore process during the stopping cluster step since the Job for the restore step isn’t created yet.

Is it possible that the Persistent Volume Claims might hinder the complete shut down of the cluster? Some automatic daily backups already got created on fs-pvc.

Thank you.

2 Likes

@Philipp_Malkmus I spend some time today playing with it and I was able to reproduce it. I believe the problem is with persistence.

pxc:
  clusterSecretName: percona-secrets
  persistence:
    size: 30Gi

This configuration does not enable persistence. If you don’t set enabled to true, PXC will use emptyDir. So I assume you don’t have persistence (PVCs). Enable it like this:

pxc:
  clusterSecretName: percona-secrets
  persistence:
    enabled: true
    size: 30Gi

The true way would be of course to use persistence. But if you for some reason really want emptyDir (really not sure if you want it for production), I created the following bug: [K8SPXC-1184] Restore fails for cluster with emptyDir - Percona JIRA

It is a valid use case to use emptyDir for testing, not sure if restore for emptyDir is a critical bug though.
Will using PVCs work for you?

@fgo - just to be clear, it is not the same issue as you have.

2 Likes

You’re right, I forgot to enable the persistence.
Thanks a lot, that fixed my problems.

2 Likes