Hello
GKE: We set up our node pool such, that it has nodes from zone-a,b and c. When the mongodb pods first start, the disks are allocated from the respective zones → pvc from zone a, b and c.
But if for example zone a is not avail and a restart is performed, the whole cluster will not start anymore.
Seems the operator needs all pods be ready in order to fully start.
Is this intended or bug ?
Regards, John
Hi John, that should not happen as far as I can tell. Can you explain exactly the steps you are following, your configuration yml and what do you see on operator pod logs?
Hello Ivan,
Give me some time - but in the meantime: zone a was not available and pod for zone a tried to start. Since the disk is located on zone a and was missing, it just got stuck. the other pods did not have a chance to startup at all
Regards, John
Hi @jamoser
When StatefulSets are first created, pods are started in order. So pod-1 requires pod-0 to be up and running and all containers ready before it can start.
If a StatefulSet is updated and requires a rolling restart, same rule applies but in reverse order. pod-2 will be restarted and pod-1 will require pod-2 to be ready.
If you use updateStrategy: SmartUpdate in your cr.yaml, same rule applies but the order is determined by the pod’s role in replset. Secondaries are updated first and primary last.
Rollout can be stuck because a pod might be crashing or pending. In this case a human needs to check the reason of crash or scheduling problem and fix. Otherwise a faulty config will be applied to all pods and the whole database cluster will be unavailable.
Hi @Ege_Gunes
lets say the pod 0 → zone a, pod 1 → zone b, pod 2 → zone c
if zone a is gone, that means also disk 0 is gone, then pod 0 will not be able to startup. Therefore the whole cluster is not able to startup.
I am not sure but with the setup below you have like redundancy, but it’s totally not transparent which zones are covered.
parameters:
type: pd-balanced
replication-type: regional-pd ← where are the disk located ?
Bottom line - as far as I understand - with the current setup it’s difficult to have a bullet proof redundant system.
[edit: statefulset … consider this
podManagementPolicy: Parallel
… this should allow the statefulset start up even if one/some pod can not. but at least then you could resize the replicas and get quorum again.
]
Hi @jamoser, I have tested the following deployment.
- I have created GKE k8s cluster
❯ kubectl get nodes -L topology.kubernetes.io/zone
NAME STATUS ROLES AGE VERSION ZONE
gke-slava-pxc-default-pool-0ea949fc-7jfz Ready 18m v1.33.5-gke.1125000 europe-west1-c
gke-slava-pxc-default-pool-0ea949fc-hm8m Ready 18m v1.33.5-gke.1125000 europe-west1-c
gke-slava-pxc-default-pool-0ea949fc-wrh8 Ready 18m v1.33.5-gke.1125000 europe-west1-c
gke-slava-pxc-default-pool-5f9e79c2-dchk Ready 18m v1.33.5-gke.1125000 europe-west1-d
gke-slava-pxc-default-pool-5f9e79c2-lnmc Ready 18m v1.33.5-gke.1125000 europe-west1-d
gke-slava-pxc-default-pool-5f9e79c2-s8lc Ready 18m v1.33.5-gke.1125000 europe-west1-d
gke-slava-pxc-default-pool-6b2df405-g1s0 Ready 18m v1.33.5-gke.1125000 europe-west1-b
gke-slava-pxc-default-pool-6b2df405-l0vq Ready 18m v1.33.5-gke.1125000 europe-west1-b
gke-slava-pxc-default-pool-6b2df405-m78v Ready 18m v1.33.5-gke.1125000 europe-west1-b
- Created SC:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: regional-fast
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-ssd # or pd-balanced for lower cost
replication-type: regional-pd # KEY: Enables cross-zone replication
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete
allowedTopologies:
- matchLabelExpressions:
- key: topology.gke.io/zone
values:
- europe-west1-b
- europe-west1-c
- europe-west1-d
- Deployed operator and cluster using the following CR:
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
name: my-cluster-name
finalizers:
- percona.com/delete-psmdb-pods-in-order
spec:
crVersion: "1.21.0"
image: percona/percona-server-mongodb:7.0.14-8
replsets:
- name: rs0
size: 3
# ZONE FAILOVER CONFIGURATION
affinity:
advanced:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100 # High priority for zone spreading
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: percona-server-mongodb
app.kubernetes.io/instance: my-cluster-name
app.kubernetes.io/replset: rs0
topologyKey: topology.kubernetes.io/zone
# USE REGIONAL STORAGE
volumeSpec:
persistentVolumeClaim:
storageClassName: regional-fast
resources:
requests:
storage: 10Gi
# MAINTAIN AVAILABILITY DURING UPDATES
podDisruptionBudget:
maxUnavailable: 1
# RESOURCES
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
- Removed one zone from node-pools
gcloud container node-pools update default-pool --cluster=*** -region=europe-west1 --node-locations=europe-west1-b,europe-west1-c
And as I can see, the pod was successfully rescheduled on a different node.
Do you have any problems with this scenario?
Hello @Slava_Sarzhan
Yes - this works because you have basically 3 disk for each pod. So for 3 replicas you have to pay for 9 disks. But isn’t the idea that you have 3 pods each with 1 disk (each in a different zone) and since likely only 1 zone fails, the others should still be able to start ?
Regards, John