Multi-zone setup failure, if one zone is not avail

Hello

GKE: We set up our node pool such, that it has nodes from zone-a,b and c. When the mongodb pods first start, the disks are allocated from the respective zones → pvc from zone a, b and c.

But if for example zone a is not avail and a restart is performed, the whole cluster will not start anymore.

Seems the operator needs all pods be ready in order to fully start.

Is this intended or bug ?

Regards, John

Hi John, that should not happen as far as I can tell. Can you explain exactly the steps you are following, your configuration yml and what do you see on operator pod logs?

Hello Ivan,

Give me some time - but in the meantime: zone a was not available and pod for zone a tried to start. Since the disk is located on zone a and was missing, it just got stuck. the other pods did not have a chance to startup at all

Regards, John

Hi @jamoser

When StatefulSets are first created, pods are started in order. So pod-1 requires pod-0 to be up and running and all containers ready before it can start.

If a StatefulSet is updated and requires a rolling restart, same rule applies but in reverse order. pod-2 will be restarted and pod-1 will require pod-2 to be ready.

If you use updateStrategy: SmartUpdate in your cr.yaml, same rule applies but the order is determined by the pod’s role in replset. Secondaries are updated first and primary last.

Rollout can be stuck because a pod might be crashing or pending. In this case a human needs to check the reason of crash or scheduling problem and fix. Otherwise a faulty config will be applied to all pods and the whole database cluster will be unavailable.

Hi @Ege_Gunes

lets say the pod 0 → zone a, pod 1 → zone b, pod 2 → zone c

if zone a is gone, that means also disk 0 is gone, then pod 0 will not be able to startup. Therefore the whole cluster is not able to startup.

I am not sure but with the setup below you have like redundancy, but it’s totally not transparent which zones are covered.

parameters:
type: pd-balanced
replication-type: regional-pd ← where are the disk located ?

Bottom line - as far as I understand - with the current setup it’s difficult to have a bullet proof redundant system.

[edit: statefulset … consider this

podManagementPolicy: Parallel

… this should allow the statefulset start up even if one/some pod can not. but at least then you could resize the replicas and get quorum again.
]

Hi @jamoser, I have tested the following deployment.

  1. I have created GKE k8s cluster
❯ kubectl get nodes -L topology.kubernetes.io/zone
NAME                                       STATUS   ROLES    AGE   VERSION               ZONE
gke-slava-pxc-default-pool-0ea949fc-7jfz   Ready       18m   v1.33.5-gke.1125000   europe-west1-c
gke-slava-pxc-default-pool-0ea949fc-hm8m   Ready       18m   v1.33.5-gke.1125000   europe-west1-c
gke-slava-pxc-default-pool-0ea949fc-wrh8   Ready       18m   v1.33.5-gke.1125000   europe-west1-c
gke-slava-pxc-default-pool-5f9e79c2-dchk   Ready       18m   v1.33.5-gke.1125000   europe-west1-d
gke-slava-pxc-default-pool-5f9e79c2-lnmc   Ready       18m   v1.33.5-gke.1125000   europe-west1-d
gke-slava-pxc-default-pool-5f9e79c2-s8lc   Ready       18m   v1.33.5-gke.1125000   europe-west1-d
gke-slava-pxc-default-pool-6b2df405-g1s0   Ready       18m   v1.33.5-gke.1125000   europe-west1-b
gke-slava-pxc-default-pool-6b2df405-l0vq   Ready       18m   v1.33.5-gke.1125000   europe-west1-b
gke-slava-pxc-default-pool-6b2df405-m78v   Ready       18m   v1.33.5-gke.1125000   europe-west1-b
  1. Created SC:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: regional-fast
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd  # or pd-balanced for lower cost
  replication-type: regional-pd  # KEY: Enables cross-zone replication
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete
allowedTopologies:
  - matchLabelExpressions:
      - key: topology.gke.io/zone
        values:
          - europe-west1-b
          - europe-west1-c
          - europe-west1-d
  1. Deployed operator and cluster using the following CR:
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: my-cluster-name
  finalizers:
    - percona.com/delete-psmdb-pods-in-order
spec:
  crVersion: "1.21.0"
  image: percona/percona-server-mongodb:7.0.14-8
  
  replsets:
    - name: rs0
      size: 3
      
      # ZONE FAILOVER CONFIGURATION
      affinity:
        advanced:
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100  # High priority for zone spreading
                podAffinityTerm:
                  labelSelector:
                    matchLabels:
                      app.kubernetes.io/name: percona-server-mongodb
                      app.kubernetes.io/instance: my-cluster-name
                      app.kubernetes.io/replset: rs0
                  topologyKey: topology.kubernetes.io/zone
      
      # USE REGIONAL STORAGE
      volumeSpec:
        persistentVolumeClaim:
          storageClassName: regional-fast
          resources:
            requests:
              storage: 10Gi
      
      # MAINTAIN AVAILABILITY DURING UPDATES
      podDisruptionBudget:
        maxUnavailable: 1
      
      # RESOURCES
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: 500m
          memory: 1Gi
  1. Removed one zone from node-pools
gcloud container node-pools update default-pool  --cluster=*** -region=europe-west1 --node-locations=europe-west1-b,europe-west1-c

And as I can see, the pod was successfully rescheduled on a different node.

Do you have any problems with this scenario?

Hello @Slava_Sarzhan

Yes - this works because you have basically 3 disk for each pod. So for 3 replicas you have to pay for 9 disks. But isn’t the idea that you have 3 pods each with 1 disk (each in a different zone) and since likely only 1 zone fails, the others should still be able to start ?

Regards, John