Affinity and Mongos and CursorNotFound - oh my!

We have a fairly minimal setup with 3-cfg, 3-mongos, 3-mongod [all for rs0]; likely in future to grow. At our initial size though we are continually running into issues with CursorNotFound when scanning reasonably small collections (i.e. the timeouts and MB limits are not coming into play).

It appears our main issue is likely connection affinity, namely a given service uses the single ClusterIP for the mongos - but this is transparently round-robin’d between the mongos instances, and if that occurs mid-scan it results in CursorNotFound.

I’m wondering what others have done to resolve this?

Potential solutions and my thoughts thus-far:

  1. Reduce Mongos to 1 - makes affinity irrelevant. This is workable today but I worry about scaling up later means i’m just pushing the problem to later.

  2. Enable sessionAffinity: ClientIP for a mongo service. We actually tried this, it does appear to reduce the incidence of issue but does not completely eliminate it. This is because our kube-proxy is configured in iptables mode and thus there is a timeout of 3h (10800 sec) where it’ll switch as before.

  3. Adjust kube-proxy to something with source-hash [e.g. ipvs-sh]; this would work but I see it as an extreme option as it affects all services, not solely mongos. Because of this I prefer not to do it.
    Alternative: Move to a more advanced network fabric that permits this routing configuration per-service. This is also an extreme option in my mind.

  4. Have a service per mongos (or semi-equivalently make it a StatefulSet not Deployment); this would have the effect of moving mongos instance selection into our client, which will not switch mid-operation and thus eliminate the problem.

My preference to solve this is #4 and this feels like the ‘correct’ solution generally - but it requires modification of the operator (or us moving away from it).

Wondering how other people have dealt with this?

1 Like

Hello @Nick_Cooper ,

thank you for submitting this.
Seems it is similar issue as described here: [K8SPSMDB-347] support session affinity for mongos service - Percona JIRA

But we have concluded, that it is an extremely rare case as the connection stays within single TCP session and should not be jumping between the nodes.

Are there any frequent mongos restarts in your cluster?
Do you have a good way to reproduce this issue?

1 Like

Hello,

I do agree it is similar to that issue (likely the same), however I do not think that conclusion of being within a single TCP session is accurate. If kube-proxy is running in iptables mode the packet filter that results in affinity will be redirected about every 3hrs by default - i.e. the connection can freely jump between nodes if the client node is sufficiently long lived.

We do not have frequent mongos restarts [in fact they have been running approximately a month], however we notice increased likelihood of this error when our client-server is itself also long-lived - we do not have reliable reproduction steps beyond have continuous traffic from a long-lived client. We have also validated that pinning a connection to a single mongos does not exhibit the issue.

Our current work-around we’re looking to do is have a k8-aware client side library to expose all the mongos pods to the pymongo client (instead of the sole synthetic service address). This could be more neatly achieved if they had stable names/individual services which is likely our next step.

1 Like

@Sergey_Pronin just wondering if you had any further thoughts on this?

If we had an experimental option to create N services we could test it in our cluster to validate if it solves the issue?

1 Like

Hello @Nick_Cooper ,

sorry, dropped a ball here. Let me discuss it with our MongoDB team internally.

1 Like

Thank you!

We decided to fork the operator and move to statefulsets. So will reply here with our findings if this fixes the problem (it is a rare error so will take some time to validate)

1 Like

As a heads up, since we migrated to our forked operator - we have seen zero incidents of this.

Our change was to make the mongos a statefulset [not deployment]; and then in pymongo give it a direct reference to each member. As noted above I believe this is an issue for any long-lived client that may see unexpected switching after the 3hr mark on k8

1 Like

Hello @Nick_Cooper .

Thank you for sharing. This means that you need a service per mongos pod, right?

1 Like

Actually no, having just the single service (as with the current operator) is fine; as statefulsets gain then stable names, we have:

 MONGO_HOST = [
    "mongo-mongos-0.mongo-mongos",
    "mongo-mongos-1.mongo-mongos",
    "mongo-mongos-2.mongo-mongos",
]

(Our mongo instance is named ‘mongo’)

2 Likes

Hello,

Just figured i’d check in again and mention that we have not seen this error at all since Sept 9th [when we forked the operator and had the client be aware of all three].

I’m now quite confident that the transparent proxy mode of K8 is at-fault here, and all clients should use stable names to all mongos instances,

Thanks,

Nick.

1 Like

Hello @Nick_Cooper - thank you for sharing.

What do you think about using Service SessionAffinity flag for this?

1 Like

@Nick_Cooper @Sergey_Pronin Please advise how can I set sessionAffinity: ClientIP. I do not see it in the chart values.yaml. I am having the same issue of CursorNotFound.

1 Like

Hello @sohahm ,

since version 1.12.0 of the Operator we run mongos as a statefulset and allow users to expose them through a service per pod. In that case your database client will take care of Cursor tracking.

See this option for more details: Custom Resource options - Percona Operator for MongoDB

And this JIRA ticket is the one delivering it: [K8SPSMDB-599] Multi-thread transaction failure when using the default mongos ClusterIP service - Percona JIRA

Please let me know if you still have questions.

3 Likes

Hey thanks for this addition!

What do you thinking about updating the Exposing the Cluster Docs.

I installed with default settings (ie servicePerPod set to false) and used the documented connection string mongodb://userAdmin:userAdminPassword@my-cluster-name-mongos.<namespace name>.svc.cluster.local/admin?ssl=false

And I thought the cluster was healthy, eventually running into the CurserNotFound error in a client application.

After updating sharding.mongos.expose.servicePerPod: true, I then had to update my connection string to mongodb://userAdmin:userAdminPassword@mongodb-mongos-0.<namespace>.svc.cluster.local:27017,mongodb-mongos-1.<namespace>.svc.cluster.local:27017/admin?ssl=false

Would be nice if the docs warned me about the error!

@Domenic_Bove - yes, I think it makes sense. Thank you for the suggestion.

We will document it here: [K8SPSMDB-1026] Document Service per Pod for mongos - Percona JIRA

Hello, I am currently facing the same issue. I have two pods for mongos, and our system uses Lambda, which is in the same VPC as the Mongo cluster but cannot access it directly. Therefore, I created a service (NodePort) for mongos. I also use CloudMap + External DNS to create DNS records for the Node IPs.

When I use sharding.mongos.expose.servicePerPod: true, there will be two services with the same annotations, which causes an error for External DNS when two services expose the same URL. Is there a way I can customize the annotations for each service?

@Thang_Mai how would you customize it? Do you want to have a separate domain name per Mongos Pod/Service?

I’m curious how does it work for other products that have multiple Service resources.

Also, is it imperative for you to use ServicePerPod?

@Sergey_Pronin Yes, I think we should have a separate domain name per Mongos Pod/Service.

I am currently getting CurserNotFound errors so I think using service per pod will be the solution for us. But applying the above solution is causing problems with our external dns and cloud map, multiple services using the same domain will cause problems for external dns when they try to override each other.