Sharding, "millions" of databases

Hello

It seems sharding is focused basically on sharded collection. How about if I want to have shards to host “millions” of small databases. Imho this should also work but I can’t find any documentation about this topic (with the percona distribution).

Thanks & Regards
John

Hi John.

MongoDB uses a file for each collection and each index, so 100’s of thousands of those per shard becomes a bottleneck at checkpoint time when filehandles are iterated as a matter of normal processing. It definitely hurts performance.

A long time ago the WiredTiger team worked on a ‘million collections’ project where they added a ‘grouped collection’ API in the KVRecordStore class etc. Many collections would put into one file by adding a prefix for the collection to every key. This was added around v3.4 but never fully surfaced in the code, and just the other day I saw it’s been removed from the v5.0 code.

So there’s been no good way, and apparently won’t ever be a good way, to have 100’s of thousands of collections and indexes per shard.

But you can have lots of shards so there’s not too many collections per shard. That works.

By default no collection is sharded. When you create a new database namespace (eg. tenantABC, tenantDEF, etc.) implicity as you create the first collection there it will have its primary shard chosen in a round-robin fashion. All the collections of that database namespace will be created on that shard only, and won’t be split to other shards unless you explicitly request that they become sharded collections.

If you want to set which shard explicitly is the primary shard for a db namespace I suggest doing the “movePrimary” as soon as you created the first collection for it.

1 Like

Hi Akira

May be there is a misunderstanding. First of all we use GKE. The idea is that we have lets say 100 shards, each with a 100GB disk and 1M of databases (each with a handful smallish collections) distributed over these 100 shards.

My concern is that each collection of these databases would be sharded. I just didnt see a switch anywhere to avoid/prevent that.

The idea is that each “tenant” would have it’s own database and since these databases are distributed over the nodes/disks each with its own replica set, the shard would only have to serve it’s tenants - I hope my conclusion is correct ! - and therefore reducing data access frictions between the tenants.

In contrary if all the tenants would be in the same collection sharded over many nodes/disks, I would think this would be quite a perfomance killer.

Thanks & Regards
John

In contrary if all the tenants would be in the same collection sharded over many nodes/disks, I would think this would be quite a perfomance killer.

No, not so, more economy by hardware can be achieved this for a given SLA about average op latency. But it’s off-topic for this question and the way WiredTiger evolved anyhow.

My concern is that each collection of these databases would be sharded. I just didnt see a switch anywhere to avoid/prevent that.

Ah, no need to worry. There is a switch to turn it on, per database and then also for each collection after that. If the switch isn’t on, then all collections stay on whatever shard is the primary shard for a given db namespace.

The ‘switch’ is to use sharding is the shardCollection command, and that can only be done after enableSharding for the same db has been run first. These change the config db’s collections “databases” and “collections” to indicate that sharding is enabled and which collections a shard key is set for.

Before they’re used there is a config.databases collection, but the only thing the mongos nodes will see in it is: “partioned”: false, and “primary”: <some_shard_id>.