3-node sharded MongoDB cluster on a budget

Hello,

I would like to have your opinion on the feasibility of running a 3-node, sharded MongoDB cluster on Kubernetes.
The use case is to able run a MongoDB database which has larger storage requirements than a single server (that I have) can meet.
This is for a hobby project so the budget is limited. Thus the 3 node constraint.
I was thinking of the following setup:

  • 3 replicasets, each composed of 2 mongod and 1 arbiter
  • sharding activated, with 3 config server pods
  • 1 mongos on each node

In practice, a deployment would look like something as (Primary / Secondary):

  • node A would run: rs0-0 (P), rs1-arbiter-0, rs2-1 (S), cfg-0, mongos
  • node B would run: rs0-1, rs1-0 (P), rs2-1 (S), r2-arbiter-0, cfg-1, mongos
  • node C would run: rs0-arbiter-0, rs1-1 (S), rs2-0 (P), cfg-2, mongos

Each node would thus host:

  • a primary of a first replicaset;
  • a secondary of a second replicaset;
  • and the arbiter of a third replicaset.
    Let’s say that each node has 1TB disk capacity and documents are perfectly distributed across the shards.
    That would mean that each shard could be as big as 0.5 TB, and the total available space for the MongoDB cluster would be 0.5 x 3 = 1.5 TB.
    With the added benefit of the redundancy, where one of the 3 node may go down without losing all the data.
    Am I correct?

I’ve already experimented with the operator and I’ve managed to get something running (with “allowUnsafeConfigurations=true”).
Although I haven’t succeeded in making Kubernetes schedule the pods exactly as I wanted.
With podAntiAffinity rules, I could ensure that no node host more than 1 pod of each replicaset.
But I couldn’t figure out how to enforce a “perfect” spread, i.e. excluding the possibly of having rs0-X, rs1-X and rs2-X running on a single node.

What do think of this architecture? Will it work? Is it sensible?
Thanks!

Damiano

1 Like

Hello @Damiano_Albani ,

sorry for not coming back to you earlier. Here are my thoughts:

  1. Using one arbiter and two nodes is unsafe. You cannot get write guarantee since MongoDB v 4.0 with such a setup. We treat 4 nodes + 1 arbiter as safe setup. That is why you need to set unsafe flag to true.

  2. Getting such a spread is kinda tricky.
    2.1 there is still a chance that your pods will be reshuffled in a case of node failure. (I hope you will set preferredDuringSchedulingIgnoredDuringExecution flag).
    2.2 To achieve this distribution you will need to manually set affinity and labels on the nodes and Pods. This is tricky and not recommended. In Operators we set affinity rules per deployment or statefulSet and this provides safe distribution.

1 Like

Hi @spronin, thanks for your feedback.

  1. My original idea to use 2 nodes and 1 arbiter came from https://docs.mongodb.com/manual/core/replica-set-architecture-three-members/ and https://docs.mongodb.com/manual/core/replica-set-arbiter/.
    Although this setup is by no means production grade, I’m curious what is the basis of your statement that “you cannot get write guarantee since MongoDB v 4.0 with such a setup”.
  2. Such a spread is indeed tricky, and in the end I did what you suggested, i.e. added a label on the nodes.
    In case someone on the interweb shall ever find it interesting, I paste here a snippet of my Helmfile based setup:
releases:
  - name: mongodb-sharded
    chart: percona-1/psmdb-db
    version: 1.10.0
    values:
      - allowUnsafeConfigurations: true
        replsets:
          {{ range $_, $rs := (list "rs0" "rs1" "rs2") }}
          - name: {{ $rs }}
            size: 2
            arbiter:
              enabled: true
              size: 1
          {{ end }}
        sharding:
          enabled: true
          mongos:
            size: 3
    jsonPatches:
      - target:
          group: psmdb.percona.com
          version: v1-10-0
          kind: PerconaServerMongoDB
          name: mongodb-sharded-psmdb-db
        patch:
          {{ range $i, $rs := (list "rs0" "rs1" "rs2") }}
          - op: add
            path: /spec/replsets/{{ $i }}/nodeSelector
            value:
              mongodb.project.xyz/{{ $rs }}: mongod
          - op: add
            path: /spec/replsets/{{ $i }}/arbiter/nodeSelector
            value:
              mongodb.project.xyz/{{ $rs }}: arbiter

Where I set such labels on 3 Kubernetes nodes:

  node-a:
    mongodb.project.xyz/rs0: mongod
    mongodb.project.xyz/rs1: mongod
    mongodb.project.xyz/rs2: arbiter
  node-b:
    mongodb.project.xyz/rs0: mongod
    mongodb.project.xyz/rs1: arbiter
    mongodb.project.xyz/rs2: mongod
  node-b:
    mongodb.project.xyz/rs0: arbiter
    mongodb.project.xyz/rs1: mongod
    mongodb.project.xyz/rs2: mongod
1 Like

Hello @Damiano_Albani ,

  1. With 2 nodes and 1 arbiter you run 2 data nodes only. In the case Primary fails and your Secondary becomes new Primary, then while your old PRIMARY is doing an initial or resync, if anything happens during that time to your new PRIMARY, you would be hosed because you would not have a complete full data bearing node available to become the new PRIMARY if necessary. It is quite a race condition-case and a rare one, but can happen.

  2. Might work! But don’t forget to set anti-affinity to preferredDuringSchedulingIgnoredDuringExecution. Otherwise you may end up with degraded cluster till node-X gets back.

1 Like