Initial-sync clone hangs forever after source primary election

PCSM 0.8.1, Atlas (8.0.23) → Percona Server 8.0.21. Cloning ~5.7 TB with --clone-segment-size=1GiB and --mongodb-operation-timeout=1h.

At ~50% cloned, Atlas did a rolling restart that elected a new primary. The clone immediately stopped. clonedSizeBytes frozen, reads/inserts/CPU/network all 0, but pcsm status still shows state: running, Initial Sync: Cloning Data. It never recovered, even >1h later (past the op timeout); no error, no retry in logs.

Recovery attempts all failed:

  • pcsm pause → cannot pause: Change Replication is not running
  • pcsm resume / resume --from-failure → cannot resume: not paused or not resuming from failure
  • pod restart → crash loop: FTL recover PCSM: cannot resume: replication is not started or not resuming from failure

So an interrupted initial sync seems unrecoverable. Only reset + full re-clone works, despite a checkpoints doc existing on the target.

Questions:

  1. Should the initial-sync clone survive a source primary election/stepdown? Why didn’t it recover or even error?
  2. Any way to resume an interrupted clone from its checkpoint instead of re-cloning from scratch?
  3. Recommended approach for large clusters where source elections during a multi-day clone are unavoidable?

Thank you!

Hello @Yeruchom ,

Thanks for reaching out to us.

First, clone phase (existing data copy) is not recoverable/resumable, only replication phase (watching and applying change stream events). So if something bad happens during clone, the sync has to be restarted.

Generally, PCSM has a mechanism to retry DB operations if some transient error happens, these are the error we will retry:
```
11602: {}, // InterruptedDueToReplStateChange
91: {}, // ShutdownInProgress
189: {}, // PrimarySteppedDown
10107: {}, // NotWritablePrimary
13435: {}, // NotPrimaryNoSecondaryOk
```

If we retry you should see a log like this “Retryable error (attempt %d): error, retrying in 15s”.

But to try to answer particular issue you encountered, we need some PCSM logs to see what is actually going on. Particular error that happened, did we try to retry, on which op it failed…