Frequent NoSuchTransaction Errors When Running mongosync with the Percona Kubernetes Operator for MongoDB

Description:

I’m using mongosync to replicate data from one MongoDB cluster to another, both managed by the Percona Kubernetes Operator. I’m seeing a flood of NoSuchTransaction errors on the destination side, and it looks like mongosync is repeatedly aborting or expiring transactions before it can commit each batch of CRUD events. Im syncing from a single replicaset instance to a sharded multi replicaset.

Steps to Reproduce:

  • Deploy Source Cluster
  • Deploy Destination Cluster
  • Run mongosync

Version:

mongodb 8

Logs:

{"time":"2025-05-13T07:45:15.302926Z","level":"debug","serverID":"7c04cb2b","mongosyncID":"coordinator","crudBatchID":"42c2b834-62ab-42e9-a986-47410fe0b245","sessionID":{"id": {"$uuid": "5dd1147f-1409-43de-8fb8-edca4f675c54"}},"componentNames":["Change Event Application","Change Event Application","CRUD Processors","CRUD Processors","Change Event Applier 45 (CRUD)"],"errorFromPreviousTransactionFunctionCall":{"msErrorLabels":["serverError"],"clientType":"destination","database":"metaverse","collection":"item_scheduler","collectionUUID":"134ed987-f29e-4dcf-bf6a-4833cdb67917","failedCommand":"RunCommand","failedRunCommand":"[{update item_scheduler} {updates [[{q {\"_id\": {\"$oid\":\"66cdf56abe81e218a7bc1931\"}}} {u [[{$_internalApplyOplogUpdate [{oplogUpdate {\"$v\": {\"$numberInt\":\"2\"},\"diff\": {\"u\": {\"fetch_sales_at\": {\"$date\":{\"$numberLong\":\"1747208712681\"}}}}}}]}]]} {multi false} {upsert false}]]} {bypassDocumentValidation true} {bypassEmptyTsReplacement true}]","message":"Change Event Applier 45 (CRUD) failed to apply update event (cluster time: &{T:1747122315 I:12}, namespace: { db: metaverse, coll: item_scheduler, sourceUUID: <nil>, destUUID: <nil> }, collUUID: 84d025f9-ec57-4094-b894-ecd4171fa7e0): failed to update document when querying on the _id field: failed to update document on destination: failed to execute a command on the MongoDB server: (NoSuchTransaction) cannot continue txnId 1991 for session 5dd1147f-1409-43de-8fb8-edca4f675c54 - O0CMtIVItQN4IsEOsJdrPL8s7jv5xwh5a/A5Qfvs2A8= -  -  with txnRetryCounter 0"},"timesCalled":2,"transactionIdentifier":"ApplyEventsBatch","message":"Calling transaction callback."}

Expected Result:

All change events should be applied cleanly in batches, with no transaction aborts. mongosync should steadily advance its progress without retrying or losing events.

Actual Result:

Each CRUD batch begins a multi‐document transaction on the destination, but frequently the server responds with:

(NoSuchTransaction) cannot continue txnId <…> for session <…> – with txnRetryCounter 0

Additional Information:

I came across this article - Don't Use Load Balancer In front of Mongos | Finisky Garden

In short:

  • The Percona Operator may expose multiple mongos instances; if mongosync’s transaction commands hit different mongos routers (due to lack of session pinning), transactions will be aborted ?
  mongos:
    size: 3
    expose:
      enabled: true
      type: LoadBalancer
      annotations:
        service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
        service.beta.kubernetes.io/aws-load-balancer-scheme: internal
        service.beta.kubernetes.io/aws-load-balancer-ip-address-type: ipv4
        service.beta.kubernetes.io/aws-load-balancer-name: acme-prod-psmdb-default-sharded

Any guidance on how to configure the Percona Operator, mongosync settings, or MongoDB parameters to eliminate these NoSuchTransaction failures would be greatly appreciated!

This documentation (Sharded Cluster Components - Database Manual v8.0 - MongoDB Docs) suggests to me if using mongos in this manner Exposing the cluster - Percona Operator for MongoDB then client affinity/sessionAffinity/sticky-sessions ought to be enabled.

however if i look at the service that is created i see

kubectl describe service psmdb-default-sharded-mongos -n mongodb

Session Affinity:         None

chatgpt

Yes. Whenever you front your mongos tier with a single LoadBalancer (or any TCP-proxy) you must enable client-affinity (aka “sticky sessions”) so that every connection from the same application process (and in particular every cursor getMore) winds up on the same mongos. Otherwise you’ll see errors like CursorNotFound or unexpected routing behavior.


The problem im having and CursorNotFound seem to be related. I noticed there is a documentation where you define this problem - Exposing the cluster - Percona Operator for MongoDB

I’m at a loss here waht i should be doing.

  1. Enabling Session Affinity at the service layer ( which there is no param exposed in the helm chart for )
  2. Enabling sticky sessions at the nlb layer
  3. Using service-per-pod
  4. using target type instance vs ip

I believe the problem has been resolved by using service-per-pod Exposing the cluster - Percona Operator for MongoDB instead of the single Exposing the cluster - Percona Operator for MongoDB.

Given the problems that the single entry point causes, i almost feel like service per pod should be default recommend way.