Unable to create new MongoDB cluster from backup (full or PITR)

Hi there, I have Everest 0.9.0-rc1 deployed on DOKS (currently running k8s v1.28.6). I have been able to deploy fresh MongoDB (version 4.4.24-23) clusters using Everest without a problem, and each of them is configured to perform full hourly backups (to Digital Ocean’s Spaces, if anyone is curious if it works). I have seeded each database with a few GB of data, and all seems well with those clusters.

However, when testing the “Create DB from a backup” functionality, I have yet to see any success. The new database does get created and begins to populate from the selected backup. I don’t have the logs handy now, but I did see information about the collections being restored, and things appeared to be what I expected to see.

However, the “restoring” state has remained in Everest. I can’t connect to the new MongoDB cluster using the credentials in the associated secret. I do see logs like this frequently (about once every 2 seconds):

{"t":{"$date":"2024-03-27T12:30:33.101+00:00"},"s":"I",  "c":"ACCESS",   "id":20250,   "ctx":"conn2901","msg":"Authentication succeeded","attr":{"mechanism":"SCRAM-SHA-256","speculative":true,"principalName":"backup","authenticationDatabase":"admin","remote":"10.244.1.21:47552","extraInfo":{}}}
{"t":{"$date":"2024-03-27T12:30:33.109+00:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn2900","msg":"Interrupted operation as its client disconnected","attr":{"opId":39486}}

Shortly after those sorts of logs, I also see this sort of log:

{"t":{"$date":"2024-03-27T12:30:34.765+00:00"},"s":"I",  "c":"ACCESS",   "id":20436,   "ctx":"conn22","msg":"Checking authorization failed","attr":{"error":{"code":13,"codeName":"Unauthorized","errmsg":"not authorized on admin to execute command { find: \"system.version\", filter: { _id: \"shardIdentity\" }, limit: 1, singleBatch: true, lsid: { id
: UUID(\"29f3cc09-27fe-42a2-9367-9f77ecf2596b\") }, $clusterTime: { clusterTime: Timestamp(1711542630, 1), signature: { hash: BinData(0, 325DE9C866A1A459037C47C190D60A2878879A10), keyId: 7351006663643496454 } }, $db: \"admin\", $readPreference: { mode: \"primaryPreferred\" } }"}}}

I have seen this sort of behavior when attempting to create a new MongoDB cluster from both full and PITR backups. Am I doing something wrong?

Hi @JayV, thank you for letting us know about this issue, we started investigating it internally.
To help us better understand what went wrong can you please share the output of the following commands?

kubectl describe psmdb/<YourDBClusterName> -n <YourNamespace>
kubectl describe psmdb-restore/<YourRestoreObjectName> -n <YourNamespace>

<YourDBClusterName> should be the name of your newly created DB cluster that got stuck and you can find <YourRestoreObjectName> by running

kubectl get psmdb-restore -n <YourNamespace>
1 Like

Hi @Diogo_Recharte, thank you for the quick response! Here is the requested output.

kubectl describe psmdb:

Name:         restored-test-mongo-1
Namespace:    cym-test
Labels:       <none>
Annotations:  <none>
API Version:  psmdb.percona.com/v1
Kind:         PerconaServerMongoDB
Metadata:
  Creation Timestamp:  2024-03-27T11:39:41Z
  Finalizers:
    delete-psmdb-pods-in-order
    delete-psmdb-pvc
  Generation:  1
  Owner References:
    API Version:           everest.percona.com/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  DatabaseCluster
    Name:                  restored-test-mongo-1
    UID:                   5b24c0a4-0c5b-4182-bcc9-8d7d1c9e1e5d
  Resource Version:        146903313
  UID:                     a04a8a30-f380-4935-a4ad-de0037a95dc9
Spec:
  Allow Unsafe Configurations:  true
  Backup:
    Enabled:  true
    Image:    percona/percona-backup-mongodb:2.3.0
    Pitr:
    Resources:
      Limits:
        Cpu:     300m
        Memory:  1G
  Cr Version:    1.15.0
  Image:         percona/percona-server-mongodb:4.4.24-23
  Multi Cluster:
    Enabled:  false
  Pmm:
    Enabled:  true
    Image:    percona/pmm-client:2
    Resources:
    Server Host:  monitoring-service.pmm.svc
  Replsets:
    Affinity:
      Anti Affinity Topology Key:  none
    Arbiter:
      Enabled:  false
      Resources:
      Size:         0
    Configuration:  
      operationProfiling:
        mode: slowOp

    Expose:
      Enabled:      false
      Expose Type:  ClusterIP
    Name:           rs0
    Nonvoting:
      Enabled:  false
      Resources:
      Size:  0
    Pod Disruption Budget:
      Max Unavailable:  1
    Resources:
      Limits:
        Cpu:     1
        Memory:  2G
    Size:        1
    Volume Spec:
      Persistent Volume Claim:
        Resources:
          Requests:
            Storage:         25G
        Storage Class Name:  do-block-storage-xfs
  Secrets:
    Encryption Key:  restored-test-mongo-1-mongodb-encryption-key
    Users:           everest-secrets-restored-test-mongo-1
  Sharding:
    Balancer:
    Enabled:        false
  Update Strategy:  SmartUpdate
  Upgrade Options:
    Apply:     disabled
    Schedule:  0 4 * * *
Status:
  Conditions:
    Last Transition Time:  2024-03-27T11:39:41Z
    Status:                True
    Type:                  initializing
    Last Transition Time:  2024-03-27T11:40:02Z
    Message:               rs0: ready
    Reason:                RSReady
    Status:                True
    Type:                  ready
    Last Transition Time:  2024-03-27T11:40:02Z
    Status:                True
    Type:                  initializing
    Last Transition Time:  2024-03-27T11:40:20Z
    Status:                True
    Type:                  ready
    Last Transition Time:  2024-03-27T11:41:10Z
    Status:                True
    Type:                  initializing
    Last Transition Time:  2024-03-27T11:41:10Z
    Status:                True
    Type:                  ready
    Last Transition Time:  2024-03-27T12:13:23Z
    Status:                True
    Type:                  initializing
    Last Transition Time:  2024-03-27T12:13:33Z
    Message:               rs0: ready
    Reason:                RSReady
    Status:                True
    Type:                  ready
    Last Transition Time:  2024-03-27T12:13:34Z
    Status:                True
    Type:                  initializing
    Last Transition Time:  2024-03-27T12:13:38Z
    Status:                True
    Type:                  ready
  Host:                    restored-test-mongo-1-rs0.cym-test.svc.cluster.local
  Mongo Image:             percona/percona-server-mongodb:4.4.24-23
  Mongo Version:           4.4.24-23
  Observed Generation:     1
  Ready:                   1
  Replsets:
    rs0:
      Initialized:  true
      Ready:        1
      Size:         1
      Status:       ready
  Size:             1
  State:            ready
Events:             <none>

kubectl describe psmdb-restore:

Name:         restored-test-mongo-1
Namespace:    cym-test
Labels:       <none>
Annotations:  <none>
API Version:  psmdb.percona.com/v1
Kind:         PerconaServerMongoDBRestore
Metadata:
  Creation Timestamp:  2024-03-27T11:40:21Z
  Generation:          1
  Owner References:
    API Version:           everest.percona.com/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  DatabaseClusterRestore
    Name:                  restored-test-mongo-1
    UID:                   7f4836c5-f4dd-4f34-844c-7e348683418a
  Resource Version:        146872796
  UID:                     7fbc5e7a-85ad-4028-b37a-23fdd6d8dc74
Spec:
  Backup Source:
    Destination:  s3://cymrix-mongo-backup/test-mongo-1/7e958cd9-e593-43aa-b64e-016901090eaf/2024-03-27T07:00:21Z
    s3:
      Bucket:              cymrix-mongo-backup
      Credentials Secret:  cymrix-mongo-backup
      Endpoint URL:        https://nyc3.digitaloceanspaces.com
      Prefix:              test-mongo-1/7e958cd9-e593-43aa-b64e-016901090eaf
      Region:              nyc3
      Server Side Encryption:
  Cluster Name:  restored-test-mongo-1
Status:
  Pbm Name:  2024-03-27T11:40:39.931795054Z
  State:     running
Events:      <none>

Perhaps this is related. I just recently attempted to restore a backup without creating a new MongoDB cluster. This restore most definitely worked: shortly before the scheduled hourly backup, I added some nonsense into the database. After initiating the “Restore from a backup” functionality from the Everest UI, the nonsense data were removed (as expected). However, that restore process is still active as well.

Name:         restore-yjx
Namespace:    cym-test
Labels:       <none>
Annotations:  <none>
API Version:  psmdb.percona.com/v1
Kind:         PerconaServerMongoDBRestore
Metadata:
  Creation Timestamp:  2024-03-27T13:03:56Z
  Generation:          1
  Owner References:
    API Version:           everest.percona.com/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  DatabaseClusterRestore
    Name:                  restore-yjx
    UID:                   5445b571-6d06-4daf-b48c-b32d8263b56d
  Resource Version:        146951060
  UID:                     2f51674e-d5c6-4ee5-9709-07cd9915f74e
Spec:
  Backup Name:   cron-test-mongo-1-20240327120000-x4j9x
  Cluster Name:  test-mongo-1
Status:
  Pbm Name:  2024-03-27T13:04:28.714968651Z
  State:     running
Events:      <none>

I connected directly to the MongoDB cluster within one of its mongod containers to verify that the nonsense data were removed. I was able to disconnect and reconnect to that cluster without issue both before and after the restore process “succeeded.”

Thanks for the added detail, it’d be great to also get a glimpse at the primary node’s logs.
Can you show me the logs of the pod with the -0 suffix?

kubectl logs pods/<YourDBClusterName>-rs0-0 -n <YourNamespace>

I know you already shared some logs above but I’m interested specifically on the ones from the backup-agent container. So maybe the following command will be better:

kubectl logs pods/<YourDBClusterName>-rs0-0 -c backup-agent -n <YourNamespace>

Sorry for the delay–got caught up in some meetings, hah. For the original case:

# kubectl logs -n cym-test pods/restored-test-mongo-1-rs0-0 -c backup-agent
2024/03/27 12:13:19 [entrypoint] starting `pbm-agent`
2024-03-27T12:13:20.000+0000 I starting PITR routine
2024-03-27T12:13:20.000+0000 I pbm-agent:
Version:   2.3.0
Platform:  linux/amd64
GitCommit: 3b1c2e263901cf041c6b83547f6f28ac2879911f
GitBranch: release-2.3.0
BuildTime: 2023-09-20_14:42_UTC
GoVersion: go1.19
2024-03-27T12:13:20.000+0000 I node: rs0/restored-test-mongo-1-rs0-0.restored-test-mongo-1-rs0.cym-test.svc.cluster.local:27017
2024-03-27T12:13:20.000+0000 I listening for the commands

I’m having a hard time understanding how you got any data restored at all because those logs don’t show any restoration process going on :confused:

Haha, I had the same thought! I will kick off another restore-to-new-cluster process shortly and try to capture the logs from the start. I have a couple more meetings to get through, so it will be a bit before I can circle back to this.

Thanks Jay, I’d greatly appreciate if you could capture some more logs with this new test you’ll perform.
In the previous message, I asked you for the PBM (backup-agent) container logs of the rs0 pod but we should also take a look at the other rs pods as well.
Another important log for us to take a look at will be the percona-server-mongodb-operator pod log.

I apologize for the delay–the meetings dominated my day yesterday. I removed the previous attempts to restore a backup into a new cluster, and I am starting over now. Here are the settings I’ve used:

I don’t know if it changes restore behavior, but I’m trying to create a single-node cluster (which is what I did yesterday too).

Oh, and as luck would have it, the new database was created successfully, and no longer indicates that it is “restoring” in the Everest UI. Yay!

I’m going to run a few more tests to see if I can get things back into the same state as they were yesterday. Do you suppose there’s any chance my wimpy k8s cluster sitting around 75% memory usage might have had an impact on the database creation/restoration process?

It is possible but I can’t be certain without logs. If you happen to come across this issue again please let us know and share the aforementioned logs, it will help us understand what went wrong.

We have tested this feature extensively on EKS and GKE but never on DOKS. We will test it there as well to see if we can replicate this behavior.
Since we are in a Beta stage, we are really focused on increasing the robustness of Percona Everest so we really appreciate this kind of report. Thank you.

1 Like

Thanks Diogo! I do believe we will be transitioning to Everest for our project, even though it’s currently classified as beta quality. So far things have been very smooth in my testing, and I’m excited to see how things improve in the future.

For what it’s worth, I’ve also been able to do a PITR to a new similarly-spec’ed single-node MongoDB cluster without a problem since my last post.

I did notice that the in-place restore I attempted to perform yesterday is now in an error state. The associated MongoDB cluster no longer shows the “Restoring” status in the UI.

Name:         restore-yjx
Namespace:    cym-test
Labels:       <none>
Annotations:  <none>
API Version:  psmdb.percona.com/v1
Kind:         PerconaServerMongoDBRestore
Metadata:
  Creation Timestamp:  2024-03-27T13:03:56Z
  Generation:          1
  Owner References:
    API Version:           everest.percona.com/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  DatabaseClusterRestore
    Name:                  restore-yjx
    UID:                   5445b571-6d06-4daf-b48c-b32d8263b56d
  Resource Version:        147069962
  UID:                     2f51674e-d5c6-4ee5-9709-07cd9915f74e
Spec:
  Backup Name:   cron-test-mongo-1-20240327120000-x4j9x
  Cluster Name:  test-mongo-1
Status:
  Error:     check for concurrent jobs: getting PBM object: create PBM connection to test-mongo-1-rs0-0.test-mongo-1-rs0.cym-test.svc.cluster.local:27017,test-mongo-1-rs0-1.test-mongo-1-rs0.cym-test.svc.cluster.local:27017,test-mongo-1-rs0-2.test-mongo-1-rs0.cym-test.svc.cluster.local:27017: create mongo connection: create mongo client: failed to
find CERTIFICATE
  Pbm Name:  2024-03-27T15:10:45.290663089Z
  State:     error
Events:      <none>

I kicked off this restore shortly after creating this thread, so maybe the certificate error here provides some insight into the issues I initially had? Here are some MongoDB operator logs mentioning the restore-yjx restore:

too large to inline, too new to upload a file

While I’m at it, I pulled back MongoDB operator logs mentioning the restored-mongo-test-1 job as well. As a heads up, I attempted multiple restorations (new cluster from PITR as well as new cluster from full backup) that used this same name, so there will likely be logs for distinct jobs in this one:

hopefully it helps