Affinity and Mongos and CursorNotFound - oh my!

We have a fairly minimal setup with 3-cfg, 3-mongos, 3-mongod [all for rs0]; likely in future to grow. At our initial size though we are continually running into issues with CursorNotFound when scanning reasonably small collections (i.e. the timeouts and MB limits are not coming into play).

It appears our main issue is likely connection affinity, namely a given service uses the single ClusterIP for the mongos - but this is transparently round-robin’d between the mongos instances, and if that occurs mid-scan it results in CursorNotFound.

I’m wondering what others have done to resolve this?

Potential solutions and my thoughts thus-far:

  1. Reduce Mongos to 1 - makes affinity irrelevant. This is workable today but I worry about scaling up later means i’m just pushing the problem to later.

  2. Enable sessionAffinity: ClientIP for a mongo service. We actually tried this, it does appear to reduce the incidence of issue but does not completely eliminate it. This is because our kube-proxy is configured in iptables mode and thus there is a timeout of 3h (10800 sec) where it’ll switch as before.

  3. Adjust kube-proxy to something with source-hash [e.g. ipvs-sh]; this would work but I see it as an extreme option as it affects all services, not solely mongos. Because of this I prefer not to do it.
    Alternative: Move to a more advanced network fabric that permits this routing configuration per-service. This is also an extreme option in my mind.

  4. Have a service per mongos (or semi-equivalently make it a StatefulSet not Deployment); this would have the effect of moving mongos instance selection into our client, which will not switch mid-operation and thus eliminate the problem.

My preference to solve this is #4 and this feels like the ‘correct’ solution generally - but it requires modification of the operator (or us moving away from it).

Wondering how other people have dealt with this?

1 Like

Hello @Nick_Cooper ,

thank you for submitting this.
Seems it is similar issue as described here: https://jira.percona.com/browse/K8SPSMDB-347

But we have concluded, that it is an extremely rare case as the connection stays within single TCP session and should not be jumping between the nodes.

Are there any frequent mongos restarts in your cluster?
Do you have a good way to reproduce this issue?

1 Like

Hello,

I do agree it is similar to that issue (likely the same), however I do not think that conclusion of being within a single TCP session is accurate. If kube-proxy is running in iptables mode the packet filter that results in affinity will be redirected about every 3hrs by default - i.e. the connection can freely jump between nodes if the client node is sufficiently long lived.

We do not have frequent mongos restarts [in fact they have been running approximately a month], however we notice increased likelihood of this error when our client-server is itself also long-lived - we do not have reliable reproduction steps beyond have continuous traffic from a long-lived client. We have also validated that pinning a connection to a single mongos does not exhibit the issue.

Our current work-around we’re looking to do is have a k8-aware client side library to expose all the mongos pods to the pymongo client (instead of the sole synthetic service address). This could be more neatly achieved if they had stable names/individual services which is likely our next step.

1 Like

@spronin just wondering if you had any further thoughts on this?

If we had an experimental option to create N services we could test it in our cluster to validate if it solves the issue?

1 Like

Hello @Nick_Cooper ,

sorry, dropped a ball here. Let me discuss it with our MongoDB team internally.

1 Like

Thank you!

We decided to fork the operator and move to statefulsets. So will reply here with our findings if this fixes the problem (it is a rare error so will take some time to validate)