We have a fairly minimal setup with 3-cfg, 3-mongos, 3-mongod [all for rs0]; likely in future to grow. At our initial size though we are continually running into issues with CursorNotFound when scanning reasonably small collections (i.e. the timeouts and MB limits are not coming into play).
It appears our main issue is likely connection affinity, namely a given service uses the single ClusterIP for the mongos - but this is transparently round-robin’d between the mongos instances, and if that occurs mid-scan it results in CursorNotFound.
I’m wondering what others have done to resolve this?
Potential solutions and my thoughts thus-far:
Reduce Mongos to 1 - makes affinity irrelevant. This is workable today but I worry about scaling up later means i’m just pushing the problem to later.
Enable sessionAffinity: ClientIP for a mongo service. We actually tried this, it does appear to reduce the incidence of issue but does not completely eliminate it. This is because our kube-proxy is configured in iptables mode and thus there is a timeout of 3h (10800 sec) where it’ll switch as before.
Adjust kube-proxy to something with source-hash [e.g. ipvs-sh]; this would work but I see it as an extreme option as it affects all services, not solely mongos. Because of this I prefer not to do it.
Alternative: Move to a more advanced network fabric that permits this routing configuration per-service. This is also an extreme option in my mind.
Have a service per mongos (or semi-equivalently make it a StatefulSet not Deployment); this would have the effect of moving mongos instance selection into our client, which will not switch mid-operation and thus eliminate the problem.
My preference to solve this is #4 and this feels like the ‘correct’ solution generally - but it requires modification of the operator (or us moving away from it).
Wondering how other people have dealt with this?
Hello @Nick_Cooper ,
thank you for submitting this.
Seems it is similar issue as described here: [K8SPSMDB-347] support session affinity for mongos service - Percona JIRA
But we have concluded, that it is an extremely rare case as the connection stays within single TCP session and should not be jumping between the nodes.
Are there any frequent mongos restarts in your cluster?
Do you have a good way to reproduce this issue?
I do agree it is similar to that issue (likely the same), however I do not think that conclusion of being within a single TCP session is accurate. If kube-proxy is running in iptables mode the packet filter that results in affinity will be redirected about every 3hrs by default - i.e. the connection can freely jump between nodes if the client node is sufficiently long lived.
We do not have frequent mongos restarts [in fact they have been running approximately a month], however we notice increased likelihood of this error when our client-server is itself also long-lived - we do not have reliable reproduction steps beyond have continuous traffic from a long-lived client. We have also validated that pinning a connection to a single mongos does not exhibit the issue.
Our current work-around we’re looking to do is have a k8-aware client side library to expose all the mongos pods to the pymongo client (instead of the sole synthetic service address). This could be more neatly achieved if they had stable names/individual services which is likely our next step.
@Sergey_Pronin just wondering if you had any further thoughts on this?
If we had an experimental option to create N services we could test it in our cluster to validate if it solves the issue?
Hello @Nick_Cooper ,
sorry, dropped a ball here. Let me discuss it with our MongoDB team internally.
We decided to fork the operator and move to statefulsets. So will reply here with our findings if this fixes the problem (it is a rare error so will take some time to validate)
As a heads up, since we migrated to our forked operator - we have seen zero incidents of this.
Our change was to make the mongos a statefulset [not deployment]; and then in pymongo give it a direct reference to each member. As noted above I believe this is an issue for any long-lived client that may see unexpected switching after the 3hr mark on k8
Hello @Nick_Cooper .
Thank you for sharing. This means that you need a service per mongos pod, right?
Actually no, having just the single service (as with the current operator) is fine; as statefulsets gain then stable names, we have:
MONGO_HOST = [
(Our mongo instance is named ‘mongo’)
Just figured i’d check in again and mention that we have not seen this error at all since Sept 9th [when we forked the operator and had the client be aware of all three].
I’m now quite confident that the transparent proxy mode of K8 is at-fault here, and all clients should use stable names to all mongos instances,
Hello @Nick_Cooper - thank you for sharing.
What do you think about using Service SessionAffinity flag for this?
@Nick_Cooper @Sergey_Pronin Please advise how can I set sessionAffinity: ClientIP. I do not see it in the chart values.yaml. I am having the same issue of CursorNotFound.
Hello @sohahm ,
since version 1.12.0 of the Operator we run mongos as a statefulset and allow users to expose them through a service per pod. In that case your database client will take care of Cursor tracking.
See this option for more details: Custom Resource options - Percona Operator for MongoDB
And this JIRA ticket is the one delivering it: [K8SPSMDB-599] Multi-thread transaction failure when using the default mongos ClusterIP service - Percona JIRA
Please let me know if you still have questions.