Initial-sync clone hangs forever after source primary election

Yeruchom · June 7, 2026, 12:53pm

PCSM 0.8.1, Atlas (8.0.23) → Percona Server 8.0.21. Cloning ~5.7 TB with --clone-segment-size=1GiB and --mongodb-operation-timeout=1h.

At ~50% cloned, Atlas did a rolling restart that elected a new primary. The clone immediately stopped. clonedSizeBytes frozen, reads/inserts/CPU/network all 0, but pcsm status still shows state: running, Initial Sync: Cloning Data. It never recovered, even >1h later (past the op timeout); no error, no retry in logs.

Recovery attempts all failed:

pcsm pause → cannot pause: Change Replication is not running
pcsm resume / resume --from-failure → cannot resume: not paused or not resuming from failure
pod restart → crash loop: FTL recover PCSM: cannot resume: replication is not started or not resuming from failure

So an interrupted initial sync seems unrecoverable. Only reset + full re-clone works, despite a checkpoints doc existing on the target.

Questions:

Should the initial-sync clone survive a source primary election/stepdown? Why didn’t it recover or even error?
Any way to resume an interrupted clone from its checkpoint instead of re-cloning from scratch?
Recommended approach for large clusters where source elections during a multi-day clone are unavoidable?

Thank you!

Inel_Pandzic · June 8, 2026, 11:10am

Hello @Yeruchom ,

Thanks for reaching out to us.

First, clone phase (existing data copy) is not recoverable/resumable, only replication phase (watching and applying change stream events). So if something bad happens during clone, the sync has to be restarted.

Generally, PCSM has a mechanism to retry DB operations if some transient error happens, these are the error we will retry:
```
11602: {}, // InterruptedDueToReplStateChange
91: {}, // ShutdownInProgress
189: {}, // PrimarySteppedDown
10107: {}, // NotWritablePrimary
13435: {}, // NotPrimaryNoSecondaryOk
```

If we retry you should see a log like this “Retryable error (attempt %d): error, retrying in 15s”.

But to try to answer particular issue you encountered, we need some PCSM logs to see what is actually going on. Particular error that happened, did we try to retry, on which op it failed…

Topic		Replies	Views
Guidance Needed — 8TB MongoDB Atlas to PSMDB Migration Using PCSM v0.7.0 Percona ClusterSync for MongoDB (PCSM)	5	109	April 8, 2026
Mongodb Intialsync failed at specific db collection Percona Operator for MongoDB	21	2899	August 3, 2023
Disaster recovery plan with PCSM Percona ClusterSync for MongoDB (PCSM)	1	39	June 5, 2026
Initial sync fails Percona Server for MongoDB	1	1022	October 7, 2020
How to restore data on a single member in replica set by PMM Percona Backup for MongoDB mongodb , pbm	3	48	June 26, 2026

Initial-sync clone hangs forever after source primary election

Related topics