Slow startup of mongod

During the startup mongod seems always go through some recovery (?) even if the cluster has benn gracefully shut down with pause=true.

At some point during startup …


{“t”:{“$date”:“2021-07-15T11:07:03.058+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347223:58084][1:0x7fa3f8703040], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 15 through 18”}}
{“t”:{“$date”:“2021-07-15T11:07:03.058+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347223:58568][1:0x7fa3f8703040], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 16 through 18”}}
{“t”:{“$date”:“2021-07-15T11:07:03.060+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347223:60113][1:0x7fa3f8703040], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 17 through 18”}}
{“t”:{“$date”:“2021-07-15T11:07:03.181+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347223:181324][1:0x7fa3f8703040], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 18 through 18”}}
{“t”:{“$date”:“2021-07-15T11:07:09.899+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347229:899390][1:0x7fa3f8703040], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Main recovery loop: starting at 15/256 to 18/256”}}
{“t”:{“$date”:“2021-07-15T11:07:09.901+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347229:901474][1:0x7fa3f8703040], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 15 through 18”}}
{“t”:{“$date”:“2021-07-15T11:07:10.095+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347230:95964][1:0x7fa3f8703040], file:sizeStorer.wt, txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 16 through 18”}}
{“t”:{“$date”:“2021-07-15T11:07:10.098+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347230:97969][1:0x7fa3f8703040], file:sizeStorer.wt, txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 17 through 18”}}
{“t”:{“$date”:“2021-07-15T11:07:10.183+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347230:183817][1:0x7fa3f8703040], file:sizeStorer.wt, txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 18 through 18”}}
{“t”:{“$date”:“2021-07-15T11:07:10.253+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347230:253942][1:0x7fa3f8703040], file:sizeStorer.wt, txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global recovery timestamp: (1626342751, 1)”}}
{“t”:{“$date”:“2021-07-15T11:07:10.254+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347230:254031][1:0x7fa3f8703040], file:sizeStorer.wt, txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global oldest timestamp: (1626342751, 1)”}}

mongodb seems to load >>all<< the data in to memory - is this the Memory Engine loading all the data ? And how can one prevent this ?

{“t”:{“$date”:“2021-07-15T11:20:53.485+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626348053:485577][1:0x7fa3f8703040], file:index-91936-5786503028257476284.wt, WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] Checkpoint has been running for 92 seconds and wrote: 5000 pages (2 MB)”}}
{“t”:{“$date”:“2021-07-15T11:20:57.476+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626348057:476567][1:0x7fa3f8703040], file:index-83900-5786503028257476284.wt, WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] Checkpoint has been running for 96 seconds and wrote: 10000 pages (4 MB)”}}

Thanks
John

1 Like

Hi John.

The ‘log’ being recovered is the WiredTiger transaction log a.k.a. the journal files.

These are supposed to be recovered, always, it’s the normal way updates since the last checkpoint are saved to disk so they can be restored after restart too.

So that’s normal, and the amount of time being taken is not abnormally slow though I’d agree it is bit slow. (I see a second or three for each log file in the server log sample above.) Each log file is <= 100MB and you’d hope that is done in less than a second, though there can be edge cases no doubt.

Given the servers being used and the total number of wt log files being replayed after restart was it abnormally long do you think?

Akira

1 Like

Hi Akira

Thanks for the reply but I think there is a misunderstanding …

{“t”:{“$date”:“2021-07-15T11:07:10.254+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626347230:254031][1:0x7fa3f8703040], file:sizeStorer.wt, txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global oldest timestamp: (1626342751, 1)”}}

Here is a time difference of 13 minutes - and looking at the internals it looks like all the data is loaded into memory during this period.

{“t”:{“$date”:“2021-07-15T11:20:53.485+00:00”},“s”:“I”, “c”:“STORAGE”, “id”:22430, “ctx”:“initandlisten”,“msg”:“WiredTiger message”,“attr”:{“message”:“[1626348053:485577][1:0x7fa3f8703040], file:index-91936-5786503028257476284.wt, WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] Checkpoint has been running for 92 seconds and wrote: 5000 pages (2 MB)”}}

We have reduced the WT cache and Percona Memory Engine cache to a couple of hundred MBs resp. to zero. But after the startup it looks like the memory used up is the same as the data on disk.

So either the WT cache or Percona Memory Engine cache settings are not respected or there is something else going on (and I dont think this is the file cache either).

=> this behaviour means that if you store like 1 TB of data then you also need 1 TB of RAM - which I think is not a good idea.

Regards
John

apiVersion: psmdb.percona.com/v1-7-0
kind: PerconaServerMongoDB
metadata:
  labels:
    klara.ch/module: my-mongodb
  name: my-mongodb-cluster
  namespace: performance
spec:
  allowUnsafeConfigurations: false
  crVersion: 1.7.0
  image: percona/percona-server-mongodb:4.4.3-5
  imagePullPolicy: Always
  mongod:
    net:
      hostPort: 0
      port: 27017
    operationProfiling:
      mode: slowOp
      rateLimit: 100
      slowOpThresholdMs: 100
    security:
      enableEncryption: true
      encryptionCipherMode: AES256-CBC
      encryptionKeySecret: my-mongodb-cluster-mongodb-encryption-key
      redactClientLogData: false
    setParameter:
      ttlMonitorSleepSecs: 60
      wiredTigerConcurrentReadTransactions: 128
      wiredTigerConcurrentWriteTransactions: 128
    storage:
      engine: wiredTiger
      # inMemory:
      #   engineConfig:
      #      inMemorySizeRatio: 0.01
      wiredTiger:
        collectionConfig:
          blockCompressor: snappy
        engineConfig:
          cacheSizeRatio: 0.05
          directoryForIndexes: false
          journalCompressor: snappy
        indexConfig:
          prefixCompression: true
  pause: false
  pmm:
    enabled: false
    image: percona/pmm-client:2.12.0
    serverHost: monitoring-service
  replsets:
  - affinity:
      antiAffinityTopologyKey: kubernetes.io/hostname
    livenessProbe:
      initialDelaySeconds: 1200
      failureThreshold: 10
    arbiter:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      enabled: false
      size: 1
    expose:
      enabled: false
      exposeType: LoadBalancer
    name: rs0
    podDisruptionBudget:
      maxUnavailable: 1
    resources:
      limits:
        cpu: "7"
        memory: 9Gi
      requests:
        cpu: 100m
        memory: 9Gi
    size: 3
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 100Gi
  secrets:
    users: my-mongodb-cluster-secrets
  sharding:
    configsvrReplSet:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      podDisruptionBudget:
        maxUnavailable: 1
      resources:
        limits:
          cpu: "4"
          memory: 1Gi
        requests:
          cpu: 100m
          memory: 1Gi
      sidecars:
      - args:
        - -c
        - while true; do echo echo $(date -u) 'test' >> /dev/null; sleep 5;done
        command:
        - /bin/sh
        image: busybox
        name: rs-sidecar-1
      size: 3
      volumeSpec:
        persistentVolumeClaim:
          resources:
            requests:
              storage: 100Gi
    enabled: true
    mongos:
      affinity:
        antiAffinityTopologyKey: kubernetes.io/hostname
      expose:
        exposeType: ClusterIP
      podDisruptionBudget:
        maxUnavailable: 1
      resources:
        limits:
          cpu: "4"
          memory: 2Gi
        requests:
          cpu: 100m
          memory: 2Gi
      size: 3
  updateStrategy: SmartUpdate
1 Like

Hi John.

Yes, there was misunderstanding. When the log was separated into two sections by a paragraph, in the first comment, I thought it was not the same restart.

The line with timestamp “2021-07-15T11:20:53.485+00:00” is another message bubbling up from the WT storage engine, but it’s a different event to the ones where the recovery of each WT transaction log file was being logged. It is a checkpoint. One that is taking a very long time.

If the shutdown happened most of a minute after the last checkpoint and there were a lot of writes in that time, the restoring of them will take a long time but in theory you would estimate it to be < (1 min * a slow-down factor caused by bringing data from disk back into WT cache).

The restore process is from the WT log → the local.oplog.rs collection, then in a next step from the oplog to the other databases. Those writes will need the affected pages/blocks to be brought into the cache.

We have reduced the WT cache and Percona Memory Engine cache to a couple of hundred MBs resp. to zero

Oh, I see. wiredTiger.engineConfig.cacheSizeRatio = 0.05 instead of default 0.5. Ouch ouch ouch. The recovery process needs to bring Btree pages from disk to apply the transaction (aka journal) updates to the documents. If the cache size isn’t ample enough to do that the cache eviction process will begin to keep the dataset size within the cache size. This could be very slow … and it seems you’ve experienced that.

1 Like

Hi Akira

We are currently “testing” on an offline installation. So there is zero traffic, no writes and whatsoever.

The setup is like this:

  • have the system idle for a couple of hours
  • kubectl apply -f cr.yaml with pause=true
  • the cluster is down
  • kubectl apply -f cr.yaml with pause=false
  • then the cluster starts
my-mongodb-cluster-cfg-0                          38m          204Mi           
my-mongodb-cluster-cfg-1                          80m          217Mi           
my-mongodb-cluster-cfg-2                          53m          200Mi           
my-mongodb-cluster-mongos-74ff77cfc-6sjlb         75m          47Mi            
my-mongodb-cluster-mongos-74ff77cfc-8h8qx         70m          47Mi            
my-mongodb-cluster-mongos-74ff77cfc-wcpvt         81m          47Mi            
my-mongodb-cluster-rs0-0                          174m         xxxMi          

we see that rs0-0 starts and we are getting very quickly to that point where my comment above is. Your statement should not apply because no writes had been done since previous startup.

Then for approx 15min there is no additional log and the memory usage goes from approx

my-mongodb-cluster-rs0-0                          174m         300Mi

to approx

my-mongodb-cluster-rs0-0                          174m         8000Mi
  1. as I understand WT opens each collection and index file at startup - here we have about 130’000 of them. Could this be the reason for the long delay ? Is there a way that WT checks them when they are needed (by a query) … kind of like “lazy loading” and does the operator support this ?

  2. further what I dont understand is, what happens with these files / file handles ? Looking at the memory consumption, it seems they are kept ? Is there a way to avoid that ? Seems this is not WT related.

  3. as I understand the file buffer (or whatever it is used to cache the files/file handles) should respect the limits.memory of the pod. But it looks like it does not ? Therefore is the whole operator setup appropriate for Kubernetes (btw the same happened to us with Postgres + Linux base image … do not go for cgroups).

Regards
John

1 Like

Hi John.

This has become a ticket of too many topics.

So there is zero traffic, no writes and whatsoever.

WT transaction log files are only filled with markers for checkpoints (a trivial size, maybe a couple kb I think) and then with writes. So there were some sort of writes being recovered when the “Recovering log X through Y” log lines appeared after restarted. The WT log files are at journal/WiredTigerLog.xxxxxxxxxxx’, where “xxx…” is the zero-paddded, eg. “Recovering log 18 through 18” would be journal/WiredTigerLog.0000000018. As mentioned I expect them to be close to 100MB in size so three such numbers being recovered implies > 100MB, four implies > 200MB, five > 300MB, etc.

If you can get the files (not sure how to do that inside the K8s operator, I don’t use it myself) you can take a look at what is in them using “wt printlog”. The “wt” tool is not simple to use, but it is the only way to confirm for yourself what’s there. I did some tutorials on how to make and use it recently: https://www.percona.com/blog/author/akira-kurogane/

  1. as I understand WT opens each collection and index file at startup - here we have about 130’000 of them. Could this be the reason for the long delay ? Is there a way that WT checks them when they are needed (by a query) … kind of like “lazy loading” and does the operator support this ?

I don’t know when exactly it will access each file, though I think not waiting so long. Say at the first checkpoint if not earlier. When there are no user record accesses it will only read the metadata at the start of each file, and I guess maybe the root page too. So several 4kb blocks only per file.

100k+ is a lot. WiredTiger doesn’t perform well at this - they attempted to address it with a ‘million collections’ storage engine change in v3.6. Most of the code went in, stayed in just a few percent short of being accessible to users, until v4.4, now it’s removed again in v5.0.

Re 3. maybe there is filehandle limitation, I don’t know.

But there’s too much going on here to identify what is what. If you want to test what WiredTiger does, and where those writes you didn’t believe should be there are coming from, please do it on a more-or less normal, ample resource server first, then try the K8s situation after. I don’t have advice for the K8s side.

1 Like

Hi Akira

Thanks a lot for your support - we had to dig into the distribution until we realised that this is not a Percona issue. We have the same issue also with the MongoDB Community Kubernetes Operator - so the issues are moved there.

Thanks & Regards
John

1 Like

@Akira_Kuroganehttps://jira.mongodb.org/browse/SERVER-43664

This solves the issue.

2 Likes