PSMDB pod placement while doing scheduled node maintenance

We are running mongodb clusters with local persistent volumes using PSMDB. Because of these local volumes DBAs often view specific pods as being tied to specific physical nodes. We can manage this using the standard kubernetes statefuset affinity, nodeSelector and antiaffinity rules. However problems arise when doing maintenance tasks like moving or removing specific members.

Scenario 1 (moving a single member):

  • We have a 3-member replicaset running on Node-A (pod-0), Node-B (pod-1), and Node-C (pod-2).
  • We need to decommission Node-A.
  • We want to move pod-0 to a new Node-D, triggering a replication sync on the new node, without outages or modifying the placement of pod-1, pod-2.

Scenario 2 (node decommissioning):

  • We have 5 member replicaset across 5 Nodes
  • We need to remove specific nodes from the cluster (the node holding pod-1 and the node holding pod-3).
  • Since StatefulSets scale down strictly by ordinal index (removing highest numbers first), we cannot simply “scale down” to remove these specific intermediate nodes.

Since the CR applies affinity rules to the entire StatefulSet, is there a recommended workflow to handle these situations?
As the operator uses standard kubernetes statefulsets affinity rules apply to all replicas uniformly.

Currently, our intended solution for the first one is to cordon the target nodes (Node-A), delete the specific pod (pod-0) and its local PVC. Is this the safest approach we could take, or does the operator expose a mechanism to pin specific members to specific nodes?
For scenario 2 we don’t see a way to do it without unnecessary shuffling of pods and doing replication multiple times.

I’d like to add some technical context to this discussion.

Currently, the Percona Operator groups members by role: all standard members are put into a single StatefulSet, non-voting members into another, and arbiters into a third.

While Kubernetes provides affinity and anti-affinity rules at the StatefulSet level, our specific requirement is to pin a specific member (pod) to a specific Kubernetes node. This would be straightforward if each member were its own StatefulSet, but the current architecture groups them together, which blocks this level of granularity.

Is there a supported way or workaround to achieve node-specific member placement and independent member lifecycle management?

Thanks,

Meftun.

Hi folks,

Sorry couldn’t find a time to take a look at this yet. I’ll allocate some time this week, most likely on Friday.

First scenario is simple:

After adding Node-D, you can cordon the node you want to decomission and delete the pod you want to move the new node. If you are using podAntiAffinity to ensure two replset pods are not running on the same worker node, deleted pod will be scheduled to the new node. Of course this assumes that you have equal number of worker nodes and replset pods. If you have more worker nodes than your replset pods, it won’t guarantee that the pod will be scheduled to the new node.

Second scenario is not possible with current architecture:

Statefulset controller forces a strict order for scale up and scale down. This is one of the core features of statefulsets. Permanently removing intermediate nodes is not possible and making it possible in a single statefulset would break a lot of guarantees we depend on.

Scheduling a specific pod to a specific Kubernetes node

I spent quite some time to find a workaround for this using a single statefulset. It’s not a straighforward thing at all.

After my investigation I see two options:

  1. Implementing a custom scheduler that you can configure which pod should go to which node.
  2. Implementing a kube-scheduler plugin where you can filter nodes by pods according a configuration.

I didn’t want to spend time on creating a PoC for these options without discussing with you. Are either of these options something you’d consider? I can create a PoC to showcase how this can be done, but creating something that can confidently run on a production cluster will require non-trivial effort, which needs to be planned in our roadmap and it’s not my singular call to do.

Of course there’s a third option that you already mentioned: running each replset instance as a separate statefulset. We’re doing this in our PostgreSQL operator and I agree, it gives a lot of flexibility. However doing this in the PSMDB operator would require A LOT of engineering effort.

cc: @Slava_Sarzhan @radoslaw.szulgo

Thanks for answering.
May be, it is better to run each member (replicaset instance) as a separate statefulset in the future.