Decrease collection size, fastest way?

Hi there,

We have a Percona MongoDB v3.6.17-4.0, runnion on Debian. One of our backends has been adding too much data, then size is 100GB, growing about 5GB every 2-3 weeks. We realized we have a bunch of documents we could delete, of the 85 MM, perhaps 1/4 of the documents can be gone, for good.

I am doing a few steps:

  1. Add a compound index to use on the deleteMany.
  2. Run the 1st deleteMany that use the compound index of 2 fields.
  3. Run another deleteMany that use a current/existing index.
  4. Compact the collection.

It takes a long time, like 20 hours in our ESXi servers. I was wondering how I can improve it? We have add more RAM to the VM, 64GB of RAM, and I am assigning 48GB to cacheSizeGB, we use the wiredTiger engine, with the journal set to false, and checkpointSizeMB set to 500.

The original database runs on a ReplicaSet, but I have setup this node as standalone, I guess I might have better results single node, for this particular task.

I was wondering, what else, I can do? add more RAM?

The collection I am refering has already many index, like 45-50 indexes, then another tests I am thinking is:

  1. drop all the indexes
  2. add the 2 or 3 index I need for this specific task
  3. Run the 1st deleteMany that use the compound index of 2 fields.
  4. Run another deleteMany that use a current
  5. Compact the collection.
  6. Add all the rest of the indexes we have, and our backends utilizes.

Help me, Obi Wan Kenobi, you’re my only hope.

1 Like

Hi.

The only low-cost data deletion is when an entire collection is dropped in one go.

Deletion of individual documents is approximately as much work as an update. Eg. deleting 1 million documents will take approximately as much time and consume as much hardware resources as updating 1 million documents.

For each deletion:

  • The document’s position in the collection data is looked up
  • The entire document is read
  • For every index on that collection: take the index’s field values from the document and remove the index entry for that doc
  • Release the document’s data address+span and add it to a freelist.

The batchwrite method (deleteMany) does reduce the amount of network requests being done but doesn’t batch the work above in any way by my understanding.

Adding an index to find the documents to delete improves the efficiency of the first step, but only the first step. There’s still a full read and other index deletions to be performed plus the random-access write to mark the document’s space as being free.

Having everything in memory will make it fast, faster at least until the flushing to disk during checkpoints saturates that capacity.

In case you’re wondering: Using TTL indexes won’t make it more effective. The deletes will be launched automatically but it is still the same work.

1 Like