Restore of Percona MongoDB backup is failing for me on Openshift 4.9

Background

I took a backup of the 5 different PerconaServerMongoDB I have yesterday morning:

percona-mongodb-infra-202208021004           percona-mongodb-infra   s3-eu-west-2   INFRA/2022-08-02T09:04:56Z   ready    6h54m       6h54m
percona-mongodb-dev2-202208021004            percona-mongodb-dev2    s3-eu-west-2   DEV2/2022-08-02T09:05:38Z    ready    6h53m       6h54m
percona-mongodb-qa-202208021004              percona-mongodb-qa      s3-eu-west-2   QA/2022-08-02T09:05:59Z      ready    6h53m       6h54m
percona-mongodb-dev-202208021008             percona-mongodb-dev     s3-eu-west-2   DEV/2022-08-02T09:08:43Z     ready    6h50m       6h50m
percona-mongodb-uat-202208021022             percona-mongodb-uat     s3-eu-west-2   UAT/2022-08-02T09:23:10Z     ready    6h34m       6h36m

I then did an openshift cluster upgrade (4.9.32 → 4.9.33) and all 5 replicasets went into crashloopbackoff with the following:

+ exec mongod --bind_ip_all --auth --dbpath=/data/db --port=27017 --replSet=rs0 --storageEngine=wiredTiger --relaxPermChecks --clusterAuthMode=x509 --slowms=100 --profile=1 --rateLimit=100 --enableEncryption --encryptionKeyFile=/etc/mongodb-encryption/encryption-key --encryptionCipherMode=AES256-CBC --wiredTigerCacheSizeGB=0.25 --wiredTigerCollectionBlockCompressor=snappy --wiredTigerJournalCompressor=snappy --wiredTigerIndexPrefixCompression=true --setParameter ttlMonitorSleepSecs=60 --setParameter wiredTigerConcurrentReadTransactions=128 --setParameter wiredTigerConcurrentWriteTransactions=128 --tlsMode preferTLS --tlsCertificateKeyFile /tmp/tls.pem --tlsAllowInvalidCertificates --tlsClusterFile /tmp/tls-internal.pem --tlsCAFile /etc/mongodb-ssl/ca.crt --tlsClusterCAFile /etc/mongodb-ssl-internal/ca.crt
{"t":{"$date":"2022-08-02T15:44:49.878+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2022-08-02T15:44:49.879+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
{"t":{"$date":"2022-08-02T15:44:49.879+00:00"},"s":"I",  "c":"NETWORK",  "id":4648601, "ctx":"main","msg":"Implicit TCP FastOpen unavailable. If TCP FastOpen is required, set tcpFastOpenServer, tcpFastOpenClient, and tcpFastOpenQueueSize."}
{"t":{"$date":"2022-08-02T15:44:49.880+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
{"t":{"$date":"2022-08-02T15:44:49.881+00:00"},"s":"I",  "c":"STORAGE",  "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":1,"port":27017,"dbPath":"/data/db","architecture":"64-bit","host":"percona-mongodb-dev2-rs0-0"}}
{"t":{"$date":"2022-08-02T15:44:49.881+00:00"},"s":"W",  "c":"CONTROL",  "id":20720,   "ctx":"initandlisten","msg":"Available memory is less than system memory","attr":{"availableMemSizeMB":476,"systemMemSizeMB":257600}}
{"t":{"$date":"2022-08-02T15:44:49.881+00:00"},"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.10-11","gitVersion":"17077507fedd996277adb88715c384329950898b","openSSLVersion":"OpenSSL 1.1.1k  FIPS 25 Mar 2021","modules":[],"allocator":"tcmalloc","environment":{"distarch":"x86_64","target_arch":"x86_64"}}}}
{"t":{"$date":"2022-08-02T15:44:49.881+00:00"},"s":"I",  "c":"CONTROL",  "id":51765,   "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"Red Hat Enterprise Linux release 8.5 (Ootpa)","version":"Kernel 4.18.0-305.45.1.el8_4.x86_64"}}}
{"t":{"$date":"2022-08-02T15:44:49.881+00:00"},"s":"I",  "c":"CONTROL",  "id":21951,   "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"net":{"bindIp":"*","port":27017,"tls":{"CAFile":"/etc/mongodb-ssl/ca.crt","allowInvalidCertificates":true,"certificateKeyFile":"/tmp/tls.pem","clusterCAFile":"/etc/mongodb-ssl-internal/ca.crt","clusterFile":"/tmp/tls-internal.pem","mode":"preferTLS"}},"operationProfiling":{"mode":"slowOp","rateLimit":100,"slowOpThresholdMs":100},"replication":{"replSet":"rs0"},"security":{"authorization":"enabled","clusterAuthMode":"x509","enableEncryption":true,"encryptionCipherMode":"AES256-CBC","encryptionKeyFile":"/etc/mongodb-encryption/encryption-key","relaxPermChecks":true},"setParameter":{"ttlMonitorSleepSecs":"60","wiredTigerConcurrentReadTransactions":"128","wiredTigerConcurrentWriteTransactions":"128"},"storage":{"dbPath":"/data/db","engine":"wiredTiger","wiredTiger":{"collectionConfig":{"blockCompressor":"snappy"},"engineConfig":{"cacheSizeGB":0.25,...
{"t":{"$date":"2022-08-02T15:44:49.882+00:00"},"s":"I",  "c":"STORAGE",  "id":22297,   "ctx":"initandlisten","msg":"Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem","tags":["startupWarnings"]}
{"t":{"$date":"2022-08-02T15:44:49.882+00:00"},"s":"I",  "c":"STORAGE",  "id":29037,   "ctx":"initandlisten","msg":"Initializing KeyDB with wiredtiger_open config: {cfg}","attr":{"cfg":"create,config_base=false,extensions=[local=(entry=percona_encryption_extension_init,early_load=true,config=(cipher=AES256-CBC,rotation=false))],encryption=(name=percona,keyid=\"\"),log=(enabled,file_max=5MB),transaction_sync=(enabled=true,method=fsync),"}}
[1659455091:766922][1:0x7f1b3b4e1bc0], file:WiredTiger.wt, connection: __posix_open_file, 808: /data/db/key.db/WiredTiger.wt: handle-open: open: Operation not permitted
[1659455091:768969][1:0x7f1b3b4e1bc0], file:WiredTiger.wt, connection: __posix_open_file, 808: /data/db/key.db/WiredTiger.wt: handle-open: open: Operation not permitted
[1659455091:770917][1:0x7f1b3b4e1bc0], file:WiredTiger.wt, connection: __posix_open_file, 808: /data/db/key.db/WiredTiger.wt: handle-open: open: Operation not permitted
{"t":{"$date":"2022-08-02T15:44:51.770+00:00"},"s":"W",  "c":"STORAGE",  "id":29054,   "ctx":"initandlisten","msg":"EncryptionKeyDB: Failed to start up WiredTiger under any compatibility version. This may be due to an unsupported upgrade or downgrade."}
{"t":{"$date":"2022-08-02T15:44:51.770+00:00"},"s":"F",  "c":"STORAGE",  "id":29056,   "ctx":"initandlisten","msg":"Terminating.","attr":{"Reason":"1: Operation not permitted"}}
{"t":{"$date":"2022-08-02T15:44:51.771+00:00"},"s":"F",  "c":"-",        "id":23091,   "ctx":"initandlisten","msg":"Fatal assertion","attr":{"msgid":29056,"file":"src/mongo/db/storage/wiredtiger/encryption_keydb.cpp","line":263}}
{"t":{"$date":"2022-08-02T15:44:51.771+00:00"},"s":"F",  "c":"-",        "id":23092,   "ctx":"initandlisten","msg":"\n\n***aborting after fassert() failure\n\n"}

I then tried to restore the infra MongoDB as follows:

  • pause the PerconaServerMongoDB
  • recreate the PVC (first tried just the rs0-0 PVC then tried recreating all 3)
  • unpause

This did not work so I decided to totally delete the infra PerconaServerMongoDB replicaset and recreate it as follows:

  1. pause the PerconaServerMongoDB
  2. delete the PerconaServerMongoDB, associated secrets and associated PVCs
  3. Create the secret referenced in /spec/secrets/usersof the PerconaServerMongoDB
  4. Create the PerconaServerMongoDB and allow the pods to come up

The mongodb replicaset came up fine so proceeded to apply the following Restore:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
  name: restore-qa-202208030755
  namespace: percona-mongodb
spec:
  backupName: percona-mongodb-qa-202208021004
  clusterName: percona-mongodb-qa

but after some time I see they are unhealthy:

percona-mongodb-qa-rs0-0                                          1/2     CrashLoopBackOff   10 (99s ago)     42m
percona-mongodb-qa-rs0-1                                          1/2     CrashLoopBackOff   10 (2m43s ago)   42m
percona-mongodb-qa-rs0-2                                          1/2     CrashLoopBackOff   11 (2m51s ago)   42m

Having to share the logs as a gist as I am not allowed to upload:

pod 0 - mongod - percona-mongodb-qa-rs0-0-mongod.log · GitHub
pod 0 - backupagent - percona-mongodb-qa-rs0-0-backup-agent.log · GitHub
pod 1 - mongod - percona-mongodb-qa-rs0-1-mongod.log · GitHub
pod 1 - backupagent - percona-mongodb-qa-rs0-1-backup-agent.log · GitHub
pod 2 - mongod - percona-mongodb-qa-rs0-2-mongod.log · GitHub
pod 2 - backupagent - percona-mongodb-qa-rs0-2-backup-agent.log · GitHub
operator logs - percona-server-mongodb-operator-964c95cf8-dsct9.log · GitHub

QA PerconaServerMongoDB - percona-qa.yaml · GitHub

The operator will be quite messy I expect as I have been trying to get all 5 PerconaServerMongoDB recovered without much luck. I was hoping I could just apply the backup and it would work so I am likely missing something here. I can only think the restore process is dependent on some state that is not provided in the resource - some secret or something? That said I welcome more expert advice :slight_smile:

1 Like

By chance we spotted a pod go out of memory and so increased the cpu and memory resources and the restore succeeded. This is good but I guess the question is why the restore was necessary in the first place… Is this a known issue with PerconaMongoDBServer running on Openshift v4.9?

I wonder if we are seeing something similar to this: mongodb - MongDB failed to start on OpenShift v3.11 because of failing to read WiredTiger.wt even though the file can be read from Terminal - Stack Overflow

Is there any advice - do I need to set a SecurityContextConstraint or special permission for it to work correctly during a cluster upgrade?

1 Like

Have been doing more investigation and interestingly a StatefulSet failing with the WiredTiger.wt issue was fixed by me running a chown in each of the 3 pods:

oc debug pod/percona-mongodb-infra-rs0-2 --as-root
sh-4.4# chown -R 1000840000:1000840000 .

After the cluster upgrade the uid was 1000840000 whereas it was 1001010000 before.

I think this is the root cause of this. Please can you advise me on how I can avoid this happening again?

1 Like

Hey @edoatley ,

could you please share what storage do you use?
Usually issues with wiredtiger.wt appear when there are some permission issues on the filesystem/storage level. I would love to try to reproduce this.

1 Like

hi @Sergey_Pronin, sorry for the delay in replying - I have been away. We are using Openshift Data Foundation 4.9.10 for the storage. This is running on bare metal VMs with attached nvme 6Tb disks which openshift is connecting to with the local storage operator version 4.9.0-202208020055

Let me know if I can provide other information

1 Like