PerconaXtraDBClusterRestore fails on restart cluster: exceeded wait limit

Description:

Hello Percona Community. We currently faced with next issue. During The Restore process fails after 19 min with status “Failed” and Comment: restart cluster: exceeded wait limit. Is there any setting which is responsible for the wait limit (did not find anything in docs/github)?

Steps to Reproduce:

We have k8s cluster with 2 dedicated nodes for Percona operator and XtraDB cluster. Percona operator is deployed from helm chart from artifacthub.io percona pxc-operator with version 1.13.0. XtraDB cluster deployed from helm chart from artifacthub.io percona pxc-db with version 1.13.0 as well.

Version:

Operator helm chart version - 1.13.0
XtraDb cluster helm chart version - 1.13.0
Operator configuration:

kind: HelmRelease
apiVersion: helm.toolkit.fluxcd.io/v2beta1
metadata:
  name: pxc-operator
  namespace: flux-system
spec:
  install:
    createNamespace: true
  releaseName: pxc-operator
  targetNamespace: percona-db-cluster
  chart:
    spec:
      chart: pxc-operator
      version: 1.13.0
      sourceRef:
        kind: HelmRepository
        name: percona
        namespace: flux-system
  interval: 1h0m0s
  values:
    tolerations:
      - key: "node.kubernetes.io/role"
        operator: "Equal"
        value: "mysql"
        effect: "NoSchedule"
    nodeSelector:
      node.kubernetes.io/server-usage: xtradb

XtraDB configuration is next:

kind: HelmRelease
apiVersion: helm.toolkit.fluxcd.io/v2beta1
metadata:
  name: pxc-db-cluster
  namespace: flux-system
spec:
  install:
    createNamespace: true
  releaseName: pxc-db-cluster
  targetNamespace: percona-db-cluster
  chart:
    spec:
      chart: pxc-db
      version: 1.13.0
      sourceRef:
        kind: HelmRepository
        name: percona
        namespace: flux-system
  interval: 1h0m0s
  values:
    backup:
      backoffLimit: 1
      schedule:
       - name: "test-local-backup"
         schedule: "0 0 * * *"
         keep: 1
         storageName: db-backup-pvc
      storages:
        db-backup-pvc:
          type: filesystem
          volume:
            persistentVolumeClaim:
              storageClassName: hcloud-volumes
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 30Gi
          containerSecurityContext:
            privileged: true
          podSecurityContext:
            fsGroup: 1001
            supplementalGroups: [1001, 1002, 1003]
    pmm:
      enabled: true
      serverHost: monitoring-service
    allowUnsafeConfigurations: true
    pxc:
      clusterSecretName: pxc-cluster-secrets
      configuration: |
        [mysqld]
        default-authentication-plugin=mysql_native_password
        [sst]
        xbstream-opts=--decompress
        [xtrabackup]
        compress=lz4
      tolerations:
        - key: "node.kubernetes.io/role"
          operator: "Equal"
          value: "mysql"
          effect: "NoSchedule"
      nodeSelector:
        node.kubernetes.io/server-usage: xtradb
      persistence:
        size: 250Gi
      size: 2
    haproxy:
      size: 2
      tolerations:
        - key: "node.kubernetes.io/role"
          operator: "Equal"
          value: "mysql"
          effect: "NoSchedule"
      nodeSelector:
        node.kubernetes.io/server-usage: xtradb

PerconaXtraDBClusterRestore config:

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterRestore
metadata:
  name: test-restore
  namespace: percona-db-cluster
spec:
  pxcCluster: pxc-db-cluster
  backupName: cron-pxc-db-cluster-db-backup-pvc-2023823900-1h2f7

Logs:

Operator logs when the error happens:
2023-08-23T15:03:32.627Z ERROR Reconciler error {“controller”: “pxcrestore-controller”, “namespace”: “percona-db-cluster”, “name”: “test-restore”, “reconcileID”: “f959da10-8e17-4ac6-8d79-8484a0f58334”, “error”: “restart cluster: exceeded wait limit”, “errorVerbose”: “exceeded wait limit\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore.(*ReconcilePerconaXtraDBClusterRestore).startCluster\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore/controller.go:380\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore.(*ReconcilePerconaXtraDBClusterRestore).Reconcile\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore/controller.go:239\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nrestart cluster\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore.(*ReconcilePerconaXtraDBClusterRestore).Reconcile\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore/controller.go:241\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594”}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226

Expected Result:

NAME CLUSTER STATUS COMPLETED AGE
test-restore pxc-db-cluster Succeeded <smth_here> <smth_here>

Actual Result:

NAME CLUSTER STATUS COMPLETED AGE
test-restore pxc-db-cluster Stopping Cluster 12s
test-restore pxc-db-cluster Restoring 32s
test-restore pxc-db-cluster Starting Cluster 8m57s
test-restore pxc-db-cluster Failed 19m

Additional Information:

The main thing is the cluster data restores successfully and pods are back without any errors.

I am also seeing the same issue here

running kubectl get pxc-restore shows the cluster stuck in “Stopping Cluster” status until it eventually times out:

kubectl get pxc-restore
NAME        CLUSTER                 STATUS             COMPLETED   AGE
restore1     test-mysql                 Stopping Cluster                           21m

eventually it times out and the restore goes into Failed status.

I have also put the operator in DEBUG logLevel mode with no real extra data in the logs.

If you look at the describe on the pxc-restore it only gives little detail:

  comments: 'stop cluster test-mysql: shutdown pods: exceeded wait limit'

Would love some assistance in resolving this
Thank you

@Stefan_Kolesnikowicz just to confirm:

  1. Is it the same version of the operator as @Dmytro_O shared?
  2. The restore acutally finishes and cluster is healthy (as @Dmytro_O states), but the status of the restore object is Failed anyway, right?

Hi @Sergey_Pronin. Please clarify to my initial post - is there any misconfiguration from my side or any kind of setting to increase wait limit?