PCSM 0.8.1, Atlas (8.0.23) → Percona Server 8.0.21. Cloning ~5.7 TB with --clone-segment-size=1GiB and --mongodb-operation-timeout=1h.
At ~50% cloned, Atlas did a rolling restart that elected a new primary. The clone immediately stopped. clonedSizeBytes frozen, reads/inserts/CPU/network all 0, but pcsm status still shows state: running, Initial Sync: Cloning Data. It never recovered, even >1h later (past the op timeout); no error, no retry in logs.
Recovery attempts all failed:
- pcsm pause → cannot pause: Change Replication is not running
- pcsm resume / resume --from-failure → cannot resume: not paused or not resuming from failure
- pod restart → crash loop: FTL recover PCSM: cannot resume: replication is not started or not resuming from failure
So an interrupted initial sync seems unrecoverable. Only reset + full re-clone works, despite a checkpoints doc existing on the target.
Questions:
- Should the initial-sync clone survive a source primary election/stepdown? Why didn’t it recover or even error?
- Any way to resume an interrupted clone from its checkpoint instead of re-cloning from scratch?
- Recommended approach for large clusters where source elections during a multi-day clone are unavoidable?
Thank you!