We have a fairly minimal setup with 3-cfg, 3-mongos, 3-mongod [all for rs0]; likely in future to grow. At our initial size though we are continually running into issues with CursorNotFound when scanning reasonably small collections (i.e. the timeouts and MB limits are not coming into play).
It appears our main issue is likely connection affinity, namely a given service uses the single ClusterIP for the mongos - but this is transparently round-robin’d between the mongos instances, and if that occurs mid-scan it results in CursorNotFound.
I’m wondering what others have done to resolve this?
Potential solutions and my thoughts thus-far:
Reduce Mongos to 1 - makes affinity irrelevant. This is workable today but I worry about scaling up later means i’m just pushing the problem to later.
Enable sessionAffinity: ClientIP for a mongo service. We actually tried this, it does appear to reduce the incidence of issue but does not completely eliminate it. This is because our kube-proxy is configured in iptables mode and thus there is a timeout of 3h (10800 sec) where it’ll switch as before.
Adjust kube-proxy to something with source-hash [e.g. ipvs-sh]; this would work but I see it as an extreme option as it affects all services, not solely mongos. Because of this I prefer not to do it. Alternative: Move to a more advanced network fabric that permits this routing configuration per-service. This is also an extreme option in my mind.
Have a service per mongos (or semi-equivalently make it a StatefulSet not Deployment); this would have the effect of moving mongos instance selection into our client, which will not switch mid-operation and thus eliminate the problem.
My preference to solve this is #4 and this feels like the ‘correct’ solution generally - but it requires modification of the operator (or us moving away from it).
But we have concluded, that it is an extremely rare case as the connection stays within single TCP session and should not be jumping between the nodes.
Are there any frequent mongos restarts in your cluster?
Do you have a good way to reproduce this issue?
I do agree it is similar to that issue (likely the same), however I do not think that conclusion of being within a single TCP session is accurate. If kube-proxy is running in iptables mode the packet filter that results in affinity will be redirected about every 3hrs by default - i.e. the connection can freely jump between nodes if the client node is sufficiently long lived.
We do not have frequent mongos restarts [in fact they have been running approximately a month], however we notice increased likelihood of this error when our client-server is itself also long-lived - we do not have reliable reproduction steps beyond have continuous traffic from a long-lived client. We have also validated that pinning a connection to a single mongos does not exhibit the issue.
Our current work-around we’re looking to do is have a k8-aware client side library to expose all the mongos pods to the pymongo client (instead of the sole synthetic service address). This could be more neatly achieved if they had stable names/individual services which is likely our next step.
We decided to fork the operator and move to statefulsets. So will reply here with our findings if this fixes the problem (it is a rare error so will take some time to validate)
As a heads up, since we migrated to our forked operator - we have seen zero incidents of this.
Our change was to make the mongos a statefulset [not deployment]; and then in pymongo give it a direct reference to each member. As noted above I believe this is an issue for any long-lived client that may see unexpected switching after the 3hr mark on k8
Just figured i’d check in again and mention that we have not seen this error at all since Sept 9th [when we forked the operator and had the client be aware of all three].
I’m now quite confident that the transparent proxy mode of K8 is at-fault here, and all clients should use stable names to all mongos instances,
@Nick_Cooper@Sergey_Pronin Please advise how can I set sessionAffinity: ClientIP. I do not see it in the chart values.yaml. I am having the same issue of CursorNotFound.
since version 1.12.0 of the Operator we run mongos as a statefulset and allow users to expose them through a service per pod. In that case your database client will take care of Cursor tracking.
I installed with default settings (ie servicePerPod set to false) and used the documented connection string mongodb://userAdmin:userAdminPassword@my-cluster-name-mongos.<namespace name>.svc.cluster.local/admin?ssl=false
And I thought the cluster was healthy, eventually running into the CurserNotFound error in a client application.
After updating sharding.mongos.expose.servicePerPod: true, I then had to update my connection string to mongodb://userAdmin:userAdminPassword@mongodb-mongos-0.<namespace>.svc.cluster.local:27017,mongodb-mongos-1.<namespace>.svc.cluster.local:27017/admin?ssl=false
Would be nice if the docs warned me about the error!
Hello, I am currently facing the same issue. I have two pods for mongos, and our system uses Lambda, which is in the same VPC as the Mongo cluster but cannot access it directly. Therefore, I created a service (NodePort) for mongos. I also use CloudMap + External DNS to create DNS records for the Node IPs.
When I use sharding.mongos.expose.servicePerPod: true, there will be two services with the same annotations, which causes an error for External DNS when two services expose the same URL. Is there a way I can customize the annotations for each service?
@Sergey_Pronin Yes, I think we should have a separate domain name per Mongos Pod/Service.
I am currently getting CurserNotFound errors so I think using service per pod will be the solution for us. But applying the above solution is causing problems with our external dns and cloud map, multiple services using the same domain will cause problems for external dns when they try to override each other.