Restore of Percona MongoDB backup is failing for me on Openshift 4.9

Background

I took a backup of the 5 different PerconaServerMongoDB I have yesterday morning:

percona-mongodb-infra-202208021004           percona-mongodb-infra   s3-eu-west-2   INFRA/2022-08-02T09:04:56Z   ready    6h54m       6h54m
percona-mongodb-dev2-202208021004            percona-mongodb-dev2    s3-eu-west-2   DEV2/2022-08-02T09:05:38Z    ready    6h53m       6h54m
percona-mongodb-qa-202208021004              percona-mongodb-qa      s3-eu-west-2   QA/2022-08-02T09:05:59Z      ready    6h53m       6h54m
percona-mongodb-dev-202208021008             percona-mongodb-dev     s3-eu-west-2   DEV/2022-08-02T09:08:43Z     ready    6h50m       6h50m
percona-mongodb-uat-202208021022             percona-mongodb-uat     s3-eu-west-2   UAT/2022-08-02T09:23:10Z     ready    6h34m       6h36m

I then did an openshift cluster upgrade (4.9.32 → 4.9.33) and all 5 replicasets went into crashloopbackoff with the following:

+ exec mongod --bind_ip_all --auth --dbpath=/data/db --port=27017 --replSet=rs0 --storageEngine=wiredTiger --relaxPermChecks --clusterAuthMode=x509 --slowms=100 --profile=1 --rateLimit=100 --enableEncryption --encryptionKeyFile=/etc/mongodb-encryption/encryption-key --encryptionCipherMode=AES256-CBC --wiredTigerCacheSizeGB=0.25 --wiredTigerCollectionBlockCompressor=snappy --wiredTigerJournalCompressor=snappy --wiredTigerIndexPrefixCompression=true --setParameter ttlMonitorSleepSecs=60 --setParameter wiredTigerConcurrentReadTransactions=128 --setParameter wiredTigerConcurrentWriteTransactions=128 --tlsMode preferTLS --tlsCertificateKeyFile /tmp/tls.pem --tlsAllowInvalidCertificates --tlsClusterFile /tmp/tls-internal.pem --tlsCAFile /etc/mongodb-ssl/ca.crt --tlsClusterCAFile /etc/mongodb-ssl-internal/ca.crt
{"t":{"$date":"2022-08-02T15:44:49.878+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2022-08-02T15:44:49.879+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
{"t":{"$date":"2022-08-02T15:44:49.879+00:00"},"s":"I",  "c":"NETWORK",  "id":4648601, "ctx":"main","msg":"Implicit TCP FastOpen unavailable. If TCP FastOpen is required, set tcpFastOpenServer, tcpFastOpenClient, and tcpFastOpenQueueSize."}
{"t":{"$date":"2022-08-02T15:44:49.880+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
{"t":{"$date":"2022-08-02T15:44:49.881+00:00"},"s":"I",  "c":"STORAGE",  "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":1,"port":27017,"dbPath":"/data/db","architecture":"64-bit","host":"percona-mongodb-dev2-rs0-0"}}
{"t":{"$date":"2022-08-02T15:44:49.881+00:00"},"s":"W",  "c":"CONTROL",  "id":20720,   "ctx":"initandlisten","msg":"Available memory is less than system memory","attr":{"availableMemSizeMB":476,"systemMemSizeMB":257600}}
{"t":{"$date":"2022-08-02T15:44:49.881+00:00"},"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.10-11","gitVersion":"17077507fedd996277adb88715c384329950898b","openSSLVersion":"OpenSSL 1.1.1k  FIPS 25 Mar 2021","modules":[],"allocator":"tcmalloc","environment":{"distarch":"x86_64","target_arch":"x86_64"}}}}
{"t":{"$date":"2022-08-02T15:44:49.881+00:00"},"s":"I",  "c":"CONTROL",  "id":51765,   "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"Red Hat Enterprise Linux release 8.5 (Ootpa)","version":"Kernel 4.18.0-305.45.1.el8_4.x86_64"}}}
{"t":{"$date":"2022-08-02T15:44:49.881+00:00"},"s":"I",  "c":"CONTROL",  "id":21951,   "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"net":{"bindIp":"*","port":27017,"tls":{"CAFile":"/etc/mongodb-ssl/ca.crt","allowInvalidCertificates":true,"certificateKeyFile":"/tmp/tls.pem","clusterCAFile":"/etc/mongodb-ssl-internal/ca.crt","clusterFile":"/tmp/tls-internal.pem","mode":"preferTLS"}},"operationProfiling":{"mode":"slowOp","rateLimit":100,"slowOpThresholdMs":100},"replication":{"replSet":"rs0"},"security":{"authorization":"enabled","clusterAuthMode":"x509","enableEncryption":true,"encryptionCipherMode":"AES256-CBC","encryptionKeyFile":"/etc/mongodb-encryption/encryption-key","relaxPermChecks":true},"setParameter":{"ttlMonitorSleepSecs":"60","wiredTigerConcurrentReadTransactions":"128","wiredTigerConcurrentWriteTransactions":"128"},"storage":{"dbPath":"/data/db","engine":"wiredTiger","wiredTiger":{"collectionConfig":{"blockCompressor":"snappy"},"engineConfig":{"cacheSizeGB":0.25,...
{"t":{"$date":"2022-08-02T15:44:49.882+00:00"},"s":"I",  "c":"STORAGE",  "id":22297,   "ctx":"initandlisten","msg":"Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem","tags":["startupWarnings"]}
{"t":{"$date":"2022-08-02T15:44:49.882+00:00"},"s":"I",  "c":"STORAGE",  "id":29037,   "ctx":"initandlisten","msg":"Initializing KeyDB with wiredtiger_open config: {cfg}","attr":{"cfg":"create,config_base=false,extensions=[local=(entry=percona_encryption_extension_init,early_load=true,config=(cipher=AES256-CBC,rotation=false))],encryption=(name=percona,keyid=\"\"),log=(enabled,file_max=5MB),transaction_sync=(enabled=true,method=fsync),"}}
[1659455091:766922][1:0x7f1b3b4e1bc0], file:WiredTiger.wt, connection: __posix_open_file, 808: /data/db/key.db/WiredTiger.wt: handle-open: open: Operation not permitted
[1659455091:768969][1:0x7f1b3b4e1bc0], file:WiredTiger.wt, connection: __posix_open_file, 808: /data/db/key.db/WiredTiger.wt: handle-open: open: Operation not permitted
[1659455091:770917][1:0x7f1b3b4e1bc0], file:WiredTiger.wt, connection: __posix_open_file, 808: /data/db/key.db/WiredTiger.wt: handle-open: open: Operation not permitted
{"t":{"$date":"2022-08-02T15:44:51.770+00:00"},"s":"W",  "c":"STORAGE",  "id":29054,   "ctx":"initandlisten","msg":"EncryptionKeyDB: Failed to start up WiredTiger under any compatibility version. This may be due to an unsupported upgrade or downgrade."}
{"t":{"$date":"2022-08-02T15:44:51.770+00:00"},"s":"F",  "c":"STORAGE",  "id":29056,   "ctx":"initandlisten","msg":"Terminating.","attr":{"Reason":"1: Operation not permitted"}}
{"t":{"$date":"2022-08-02T15:44:51.771+00:00"},"s":"F",  "c":"-",        "id":23091,   "ctx":"initandlisten","msg":"Fatal assertion","attr":{"msgid":29056,"file":"src/mongo/db/storage/wiredtiger/encryption_keydb.cpp","line":263}}
{"t":{"$date":"2022-08-02T15:44:51.771+00:00"},"s":"F",  "c":"-",        "id":23092,   "ctx":"initandlisten","msg":"\n\n***aborting after fassert() failure\n\n"}

I then tried to restore the infra MongoDB as follows:

  • pause the PerconaServerMongoDB
  • recreate the PVC (first tried just the rs0-0 PVC then tried recreating all 3)
  • unpause

This did not work so I decided to totally delete the infra PerconaServerMongoDB replicaset and recreate it as follows:

  1. pause the PerconaServerMongoDB
  2. delete the PerconaServerMongoDB, associated secrets and associated PVCs
  3. Create the secret referenced in /spec/secrets/usersof the PerconaServerMongoDB
  4. Create the PerconaServerMongoDB and allow the pods to come up

The mongodb replicaset came up fine so proceeded to apply the following Restore:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: restore-qa-202208030755
  namespace: percona-mongodb
spec:
  backupName: percona-mongodb-qa-202208021004
  clusterName: percona-mongodb-qa

but after some time I see they are unhealthy:

percona-mongodb-qa-rs0-0                                          1/2     CrashLoopBackOff   10 (99s ago)     42m
percona-mongodb-qa-rs0-1                                          1/2     CrashLoopBackOff   10 (2m43s ago)   42m
percona-mongodb-qa-rs0-2                                          1/2     CrashLoopBackOff   11 (2m51s ago)   42m

Having to share the logs as a gist as I am not allowed to upload:

pod 0 - mongod - percona-mongodb-qa-rs0-0-mongod.log · GitHub
pod 0 - backupagent - percona-mongodb-qa-rs0-0-backup-agent.log · GitHub
pod 1 - mongod - percona-mongodb-qa-rs0-1-mongod.log · GitHub
pod 1 - backupagent - percona-mongodb-qa-rs0-1-backup-agent.log · GitHub
pod 2 - mongod - percona-mongodb-qa-rs0-2-mongod.log · GitHub
pod 2 - backupagent - percona-mongodb-qa-rs0-2-backup-agent.log · GitHub
operator logs - percona-server-mongodb-operator-964c95cf8-dsct9.log · GitHub

QA PerconaServerMongoDB - percona-qa.yaml · GitHub

The operator will be quite messy I expect as I have been trying to get all 5 PerconaServerMongoDB recovered without much luck. I was hoping I could just apply the backup and it would work so I am likely missing something here. I can only think the restore process is dependent on some state that is not provided in the resource - some secret or something? That said I welcome more expert advice :slight_smile:

1 Like

By chance we spotted a pod go out of memory and so increased the cpu and memory resources and the restore succeeded. This is good but I guess the question is why the restore was necessary in the first place… Is this a known issue with PerconaMongoDBServer running on Openshift v4.9?

I wonder if we are seeing something similar to this: mongodb - MongDB failed to start on OpenShift v3.11 because of failing to read WiredTiger.wt even though the file can be read from Terminal - Stack Overflow

Is there any advice - do I need to set a SecurityContextConstraint or special permission for it to work correctly during a cluster upgrade?

1 Like

Have been doing more investigation and interestingly a StatefulSet failing with the WiredTiger.wt issue was fixed by me running a chown in each of the 3 pods:

oc debug pod/percona-mongodb-infra-rs0-2 --as-root
sh-4.4# chown -R 1000840000:1000840000 .

After the cluster upgrade the uid was 1000840000 whereas it was 1001010000 before.

I think this is the root cause of this. Please can you advise me on how I can avoid this happening again?

1 Like