GKE 1.26, Operator 1.14, after cluster deletion and restore (delete/apply), volumes can not be attached

Hello

Not sure but could it be that there is a problem with GKE 1.26 and Operator 1.14 ?

→ after delete/apply the pods did not start due to:

Events:

Type Reason Age From Message


Normal Scheduled 2m56s gke.io/optimize-utilization-scheduler Successfully assigned performance-mongodb-clusters/luz-mongodb00-cluster-rs-0 to gke-performance-default-pool-b4cd7cbc-lm6j

Warning FailedMount 53s kubelet Unable to attach or mount volumes: unmounted volumes=[mongod-data], unattached volumes=[ssl config users-secret-file ssl-internal kube-api-access-fg9wc mongodb00-cluster-mongodb-keyfile mongod-data bin mongodb-cluster-encryption-key]: timed out waiting for the condition

Warning FailedMount 48s (x9 over 2m56s) kubelet MountVolume.MountDevice failed for volume “pvc-55608254-6b41-43e1-a591-9b19484bdf63” : rpc error: code = Aborted desc = An operation with the given Volume ID projects/UNSPECIFIED/zones/europe-west6-a/disks/gke-performance–pvc-55608254-6b41-43e1-a591-9b19484bdf63 already exists

=> it’s kind of urgent !

Hi @jamoser !

Do you have finalizer delete-psmdb-pvc enabled or not?

Hello

No but I hope delete-psmdb-pvc is off if not mentioned !

The PVC/PV are still there …

pvc-55608254-6b41-43e1-a591-9b19484bdf63 already exists

The issue seems, it can not assign existing PVCs. And it seems it’s related to GKE 1.26 because with 1.25 it worked.

Important: cluster is running in non-sharding setup.

I cannot seem to reproduce the issue.

Before delete:

# k get nodes 
NAME                                       STATUS   ROLES    AGE   VERSION
gke-plavi-126-default-pool-673a53d4-8tcn   Ready    <none>   32m   v1.26.10-gke.1038000
gke-plavi-126-default-pool-673a53d4-kt38   Ready    <none>   31m   v1.26.10-gke.1038000
gke-plavi-126-default-pool-673a53d4-tvfb   Ready    <none>   32m   v1.26.10-gke.1038000
# k get pods
NAME                                               READY   STATUS    RESTARTS   AGE
my-cluster-name-rs0-0                              2/2     Running   0          110s
my-cluster-name-rs0-1                              2/2     Running   0          81s
my-cluster-name-rs0-2                              2/2     Running   0          58s
percona-server-mongodb-operator-7b46fb8f97-rt57l   1/1     Running   0          2m27s
# k get pods 
NAME                                               READY   STATUS    RESTARTS   AGE
my-cluster-name-rs0-0                              2/2     Running   0          119s
my-cluster-name-rs0-1                              2/2     Running   0          90s
my-cluster-name-rs0-2                              2/2     Running   0          67s
percona-server-mongodb-operator-7b46fb8f97-rt57l   1/1     Running   0          2m36s
# k get pvc 
NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mongod-data-my-cluster-name-rs0-0   Bound    pvc-00321f8e-7fbe-41f8-baf8-b9aaca70948e   3Gi        RWO            standard-rwo   2m18s
mongod-data-my-cluster-name-rs0-1   Bound    pvc-e03b82c9-57bb-4209-82b5-f37c5885bd2f   3Gi        RWO            standard-rwo   109s
mongod-data-my-cluster-name-rs0-2   Bound    pvc-693fd39e-8584-4b8c-ba71-358f4c0234e1   3Gi        RWO            standard-rwo   86s
# k get pv 
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                    STORAGECLASS   REASON   AGE
pvc-00321f8e-7fbe-41f8-baf8-b9aaca70948e   3Gi        RWO            Delete           Bound    test/mongod-data-my-cluster-name-rs0-0   standard-rwo            2m17s
pvc-693fd39e-8584-4b8c-ba71-358f4c0234e1   3Gi        RWO            Delete           Bound    test/mongod-data-my-cluster-name-rs0-2   standard-rwo            86s
pvc-e03b82c9-57bb-4209-82b5-f37c5885bd2f   3Gi        RWO            Delete           Bound    test/mongod-data-my-cluster-name-rs0-1   standard-rwo            108s

After delete:

# k delete -f cr.yaml
perconaservermongodb.psmdb.percona.com "my-cluster-name" deleted
# k get pods         
NAME                                               READY   STATUS    RESTARTS   AGE
percona-server-mongodb-operator-7b46fb8f97-rt57l   1/1     Running   0          4m3s
# k get pvc  
NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mongod-data-my-cluster-name-rs0-0   Bound    pvc-00321f8e-7fbe-41f8-baf8-b9aaca70948e   3Gi        RWO            standard-rwo   3m33s
mongod-data-my-cluster-name-rs0-1   Bound    pvc-e03b82c9-57bb-4209-82b5-f37c5885bd2f   3Gi        RWO            standard-rwo   3m4s
mongod-data-my-cluster-name-rs0-2   Bound    pvc-693fd39e-8584-4b8c-ba71-358f4c0234e1   3Gi        RWO            standard-rwo   2m41s
# k get pv                            
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                    STORAGECLASS   REASON   AGE
pvc-00321f8e-7fbe-41f8-baf8-b9aaca70948e   3Gi        RWO            Delete           Bound    test/mongod-data-my-cluster-name-rs0-0   standard-rwo            3m34s
pvc-693fd39e-8584-4b8c-ba71-358f4c0234e1   3Gi        RWO            Delete           Bound    test/mongod-data-my-cluster-name-rs0-2   standard-rwo            2m43s
pvc-e03b82c9-57bb-4209-82b5-f37c5885bd2f   3Gi        RWO            Delete           Bound    test/mongod-data-my-cluster-name-rs0-1   standard-rwo            3m5s

After re-apply:

# ka cr.yaml
perconaservermongodb.psmdb.percona.com/my-cluster-name created
# k get pods
NAME                                               READY   STATUS    RESTARTS   AGE
my-cluster-name-rs0-0                              2/2     Running   0          97s
my-cluster-name-rs0-1                              2/2     Running   0          76s
my-cluster-name-rs0-2                              2/2     Running   0          51s
percona-server-mongodb-operator-7b46fb8f97-rt57l   1/1     Running   0          6m17s
# k get pvc 
NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mongod-data-my-cluster-name-rs0-0   Bound    pvc-00321f8e-7fbe-41f8-baf8-b9aaca70948e   3Gi        RWO            standard-rwo   5m43s
mongod-data-my-cluster-name-rs0-1   Bound    pvc-e03b82c9-57bb-4209-82b5-f37c5885bd2f   3Gi        RWO            standard-rwo   5m14s
mongod-data-my-cluster-name-rs0-2   Bound    pvc-693fd39e-8584-4b8c-ba71-358f4c0234e1   3Gi        RWO            standard-rwo   4m51s
# k get pv 
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                    STORAGECLASS   REASON   AGE
pvc-00321f8e-7fbe-41f8-baf8-b9aaca70948e   3Gi        RWO            Delete           Bound    test/mongod-data-my-cluster-name-rs0-0   standard-rwo            5m41s
pvc-693fd39e-8584-4b8c-ba71-358f4c0234e1   3Gi        RWO            Delete           Bound    test/mongod-data-my-cluster-name-rs0-2   standard-rwo            4m50s
pvc-e03b82c9-57bb-4209-82b5-f37c5885bd2f   3Gi        RWO            Delete           Bound    test/mongod-data-my-cluster-name-rs0-1   standard-rwo            5m12s

Can you share your cr.yaml without sensitive data maybe?

Also btw. operator 1.14 was not officially tested with GKE 1.26, 1.25 was latest, although this above I have tried with 1.14 on 1.26.

I can remember, that with 1.24 and 1.25 it worked. But Google updated the nodes and we got some surprise …

Below the “cr.yaml”

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  labels:
    xyz.com/module: my-mongodb
  name: my-mongodb00-cluster
  namespace: performance-mongodb-clusters
spec:
  allowUnsafeConfigurations: false
  backup:
    enabled: false
  crVersion: 1.14.0
  image: percona/percona-server-mongodb:4.4.16-16
  imagePullPolicy: Always
  mongod:
    net:
      hostPort: 0
      port: 27017
    operationProfiling:
      mode: slowOp
      rateLimit: 100
      slowOpThresholdMs: 1000
    security:
      enableEncryption: true
      encryptionCipherMode: AES256-CBC
      encryptionKeySecret: my-mongodb-cluster-encryption-key
      redactClientLogData: false
    setParameter:
      ttlMonitorSleepSecs: 60
      wiredTigerConcurrentReadTransactions: 128
      wiredTigerConcurrentWriteTransactions: 128
    storage:
      engine: wiredTiger
      wiredTiger:
        collectionConfig:
          blockCompressor: snappy
        engineConfig:
          cacheSizeRatio: 0.005
          directoryForIndexes: false
          journalCompressor: snappy
        indexConfig:
          prefixCompression: true
  pause: false
  pmm:
    enabled: false
    image: percona/pmm-client:2.35.0
  replsets:
  - affinity:
      antiAffinityTopologyKey: kubernetes.io/hostname
    arbiter:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      enabled: false
      size: 1
    configuration: |
      systemLog:
        quiet: true
      storage:
        directoryPerDB: true
        wiredTiger:
          engineConfig:
            configString: "file_manager=(close_idle_time=300,close_scan_interval=60,close_handle_minimum=1000)"
    expose:
      enabled: true
      exposeType: NodePort
      clusterServiceDNSMode: External
    livenessProbe:
      failureThreshold: 40
      initialDelaySeconds: 1800
    name: rs
    tolerations:
    - effect: NoSchedule
      key: mongodb
      operator: Exists
    podDisruptionBudget:
      maxUnavailable: 1
    resources:
      limits:
        cpu: 7000m
        memory: 16G
      requests:
        cpu: 10m
        memory: 0.5G
    size: 3
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 32Gi
        storageClassName: my-mongodb-standard
  secrets:
    encryptionKey: my-mongodb-cluster-encryption-key
    users: my-mongodb-cluster-secrets
  sharding:
    enabled: false
  updateStrategy: SmartUpdate
  backup:
    enabled: false
    image: percona/percona-backup-mongodb:2.0.4

Ok … I had to upgrade to Operator 1.15 and it worked.

It showed then this error:

Multi-Attach error for volume “pvc-xxxxx” Volume is already exclusively attached to one node and can’t be attached to another

Not sure if this was the same error for Operator 1.14 or if 1.15 handled it differently.

Do you have a matrix where one can see which Operator version is supported for which GKE version ?

Unfortunately our docs show only system requirements for latest released version: System requirements - Percona Operator for MongoDB

But I have now created a ticket to add matrix for this for all operators: [CLOUD-819] Create matrix of supported platforms for each operator and tag docs repo - Percona JIRA
In the ticket I added a table for last 3 versions of PSMDB operator (if you need more at the current moment please comment and I will add more).
[CLOUD-819] Create matrix of supported platforms for each operator and tag docs repo - Percona JIRA

1 Like