Description:
We operate MongoDB clusters with collection sizes reaching several hundred GBs. The primary challenge we faced was the extended time required to add additional replicas due to lengthy initial synchronization processes, which significantly impacted our ability to scale during production incidents.
Solution Implemented: I developed a Custom Resource Definition (CRD) and Kubernetes operator that optimizes the replica addition process by:
- Automatically identifying the primary pod/replica within a given MongoDB cluster
- Creating a Persistent Volume Claim (PVC) snapshot from the primary replica’s data
- Using the snapshot to provision new PVCs with pre-populated data for additional replicas
- Eliminating the need for full initial sync by leveraging existing data
This solution has dramatically reduced our Time-to-Availability (TaT) for production scaling operations during incidents and routine capacity expansion.
Steps to Reproduce:
- Deploy MongoDB cluster with large dataset (100+ GB collections)
- Attempt to add new replica using standard MongoDB replica set scaling
- Monitor initial sync duration and resource consumption
- Compare with snapshot-based replica provisioning using the custom operator
Version:
percona operator latest
Logs:
NA
Expected Result:
Faster Scalling of new replicas
Actual Result:
Reduced TaT by almost 80%+
Additional Information:
Would like to know, if i can send a pull request for the operator and CRDs also, this operator is designed to be a standalone operator working on top of percona operator deployment ( not an extension to the existing code of percona operator ).
If someone can help me with how to send a PR for a new operator in percona/everest github, I would appreciate it.