Description:
Hello Percona Community. We currently faced with next issue. During The Restore process fails after 19 min with status “Failed” and Comment: restart cluster: exceeded wait limit. Is there any setting which is responsible for the wait limit (did not find anything in docs/github)?
Steps to Reproduce:
We have k8s cluster with 2 dedicated nodes for Percona operator and XtraDB cluster. Percona operator is deployed from helm chart from artifacthub.io percona pxc-operator with version 1.13.0. XtraDB cluster deployed from helm chart from artifacthub.io percona pxc-db with version 1.13.0 as well.
Version:
Operator helm chart version - 1.13.0
XtraDb cluster helm chart version - 1.13.0
Operator configuration:
kind: HelmRelease
apiVersion: helm.toolkit.fluxcd.io/v2beta1
metadata:
name: pxc-operator
namespace: flux-system
spec:
install:
createNamespace: true
releaseName: pxc-operator
targetNamespace: percona-db-cluster
chart:
spec:
chart: pxc-operator
version: 1.13.0
sourceRef:
kind: HelmRepository
name: percona
namespace: flux-system
interval: 1h0m0s
values:
tolerations:
- key: "node.kubernetes.io/role"
operator: "Equal"
value: "mysql"
effect: "NoSchedule"
nodeSelector:
node.kubernetes.io/server-usage: xtradb
XtraDB configuration is next:
kind: HelmRelease
apiVersion: helm.toolkit.fluxcd.io/v2beta1
metadata:
name: pxc-db-cluster
namespace: flux-system
spec:
install:
createNamespace: true
releaseName: pxc-db-cluster
targetNamespace: percona-db-cluster
chart:
spec:
chart: pxc-db
version: 1.13.0
sourceRef:
kind: HelmRepository
name: percona
namespace: flux-system
interval: 1h0m0s
values:
backup:
backoffLimit: 1
schedule:
- name: "test-local-backup"
schedule: "0 0 * * *"
keep: 1
storageName: db-backup-pvc
storages:
db-backup-pvc:
type: filesystem
volume:
persistentVolumeClaim:
storageClassName: hcloud-volumes
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 30Gi
containerSecurityContext:
privileged: true
podSecurityContext:
fsGroup: 1001
supplementalGroups: [1001, 1002, 1003]
pmm:
enabled: true
serverHost: monitoring-service
allowUnsafeConfigurations: true
pxc:
clusterSecretName: pxc-cluster-secrets
configuration: |
[mysqld]
default-authentication-plugin=mysql_native_password
[sst]
xbstream-opts=--decompress
[xtrabackup]
compress=lz4
tolerations:
- key: "node.kubernetes.io/role"
operator: "Equal"
value: "mysql"
effect: "NoSchedule"
nodeSelector:
node.kubernetes.io/server-usage: xtradb
persistence:
size: 250Gi
size: 2
haproxy:
size: 2
tolerations:
- key: "node.kubernetes.io/role"
operator: "Equal"
value: "mysql"
effect: "NoSchedule"
nodeSelector:
node.kubernetes.io/server-usage: xtradb
PerconaXtraDBClusterRestore config:
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterRestore
metadata:
name: test-restore
namespace: percona-db-cluster
spec:
pxcCluster: pxc-db-cluster
backupName: cron-pxc-db-cluster-db-backup-pvc-2023823900-1h2f7
Logs:
Operator logs when the error happens:
2023-08-23T15:03:32.627Z ERROR Reconciler error {“controller”: “pxcrestore-controller”, “namespace”: “percona-db-cluster”, “name”: “test-restore”, “reconcileID”: “f959da10-8e17-4ac6-8d79-8484a0f58334”, “error”: “restart cluster: exceeded wait limit”, “errorVerbose”: “exceeded wait limit\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore.(*ReconcilePerconaXtraDBClusterRestore).startCluster\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore/controller.go:380\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore.(*ReconcilePerconaXtraDBClusterRestore).Reconcile\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore/controller.go:239\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nrestart cluster\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore.(*ReconcilePerconaXtraDBClusterRestore).Reconcile\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxcrestore/controller.go:241\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594”}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/controller/controller.go:226
Expected Result:
NAME CLUSTER STATUS COMPLETED AGE
test-restore pxc-db-cluster Succeeded <smth_here> <smth_here>
Actual Result:
NAME CLUSTER STATUS COMPLETED AGE
test-restore pxc-db-cluster Stopping Cluster 12s
test-restore pxc-db-cluster Restoring 32s
test-restore pxc-db-cluster Starting Cluster 8m57s
test-restore pxc-db-cluster Failed 19m
Additional Information:
The main thing is the cluster data restores successfully and pods are back without any errors.