I am planning an 8TB migration from MongoDB Atlas (3 node replica set) to a 3-node Percona Server for MongoDB replica set using PCSM v0.7.0. Looking for guidance from anyone who has used PCSM at multi-TB scale, or from the Percona team.
Environment:
- Source: MongoDB Atlas Replica Set, ~8TB
- Target: 3-node PSMDB Replica Set (same major version)
- PCSM host: Dedicated EC2 node, same VPC as target
- Oplog churn: ~6 GB/hr average
- Oplog size: Increasing from 325 GB to 650 GB (~96 hour replication window based on observed 48hr at 325 GB)
- Cannot set minimum oplog retention on Atlas (requires disk autoscaling, disabled by org policy)
- Post-migration: Plan to shard the PSMDB cluster
Key concerns:
- Scale confidence — Has PCSM been used for multi-TB migrations? The largest tests I found in the codebase are ~2GB collections. I’d like to know if there are known scale limits or internal benchmarks.
- Oplog window risk — Without a minimum retention guarantee on Atlas, the 96-hour window is an estimate. Does PCSM monitor or warn when the source oplog window is
shrinking during clone? If ChangeStreamHistoryLost occurs, is the only recovery path pcsm reset and full restart? - Retry limits — I noticed DefaultMaxRetries=3 with exponential backoff totalling ~35 seconds. For a multi-day migration, transient network issues or Atlas maintenance could exceed this. Can these be made configurable?
- No mid-collection resume — If a large collection (hundreds of GB) fails partway through clone, it restarts from the beginning. Is segment-level checkpointing on the roadmap?
- Operation timeout — The default 5-minute timeout (PCSM_MONGODB_CLI_OPERATION_TIMEOUT) applies to all operations including index creation on large collections. Is it safe to set this to 60m?
- Undocumented parameters — I found these env vars in the codebase that aren’t on the
Percona ClusterSync for MongoDB startup configuration - Percona ClusterSync for MongoDB . Which are stable for production use?- PCSM_REPL_NUM_WORKERS
- PCSM_REPL_CHANGE_STREAM_BATCH_SIZE
- PCSM_REPL_EVENT_QUEUE_SIZE
- PCSM_REPL_WORKER_QUEUE_SIZE
- PCSM_REPL_BULK_OPS_SIZE
- PCSM_CLONE_SEGMENT_SIZE
- PCSM_CLONE_READ_BATCH_SIZE
- PCSM_DEV_TARGET_CLIENT_COMPRESSORS
- PCSM_MONGODB_OPERATION_TIMEOUT
- Connection pool and compression — PCSM strips maxPoolSize and related options from connection strings. With high parallelism, could this bottleneck? Also, source-side compression isn’t available (only target via PCSM_DEV_TARGET_CLIENT_COMPRESSORS) — any plans to add it?
Proposed implementation plan:
- Increase Atlas oplog to 650 GB, wait 4-5 days
- Run full PCSM test against staging cluster first ( though very small 4GB so not the best indicator for performance)
- Execute production migration during lowest write activity with active oplog monitoring
Any recommended PCSM configuration values for this scale would be greatly appreciated.
Thanks in advance.