MongoDB v3.6 ReplicaSet, slow choosing node time on connection?

Hello there,
Last Saturday we were doing some deployment/maintenance on our backends. We did a rolling restart “one at the time” of our Percona MongoDB cluster. After the reboots doing some QA we noticed some slowness when retrieving/fetching data from one of our backends. We have some applications running on NodeJS and other running on PHP 7.x, both applications connect to our Mongo ReplicaSet with the same connection String.

We are running Percona Server for MongoDB version: v3.6.11-3.1, on Debian 8.9 “jessie” on some VMs. The hosts are Dell PowerEdge R730xd running VMWare +6.x or later.
We noticed something odd, calling the same API with same arguments, it fetched data from Mongo ReplicaSet using PHP 7.x, where it took 1-2 seconds to get the data but sometimes it took up to 10-15 seconds to get the data. We hit/called that API a bunch of times where we tried to change the connection String, etc. At that time, we had our website under maintenance then we weren’t getting any real live traffic.

We have a 5 nodes ReplicaSet, where Mongo4 is on another VLAN from Mongo1-3 but in the same data-center, and Mongo5 is in another data-center, I stopped the Mongo services on Mongo4 and Mongo5 just to remove “possible” latency. The end result, were 3 servers running the same operating system with the same version of Percona Mongo v3.6. Even with the 3 Mongo servers only, we were experiencing the odd behavior of fetching data on 1-2 secs, but sometimes +10 seconds.

Here is our connection String:


Doing some brainstorming, we added the “connectTimeoutMS=900” and it seems to have solved our problem, meaning the API calls are now taking 1-2 secs or less. The final connection String:


Our conclusion was: the MongoDB driver on the client side reaches the Replica Set and somehow it is having issues or timing-out when reaching a secondary, then it switches to use the primary and the query gets executed.
I ran some rs.status() and I saw the nodes were synced. How I can know if there is an issue when connecting, perhaps a node is showing synced but there is something wrong reading from a secondary?
I will appreciate any advice/comment in this mater.

Hi J.
I don’t know what the cause of the high latency was, but I can comment on the theory that using the connectTimeoutMS was causing a new attempt at reads. It wouldn’t have switched to read from a primary because “readPreference=secondary” won’t do that. (“readPreference=secondaryPreferred” will.) If you have two secondaries (as is typical) or more it’s possible that it succeeded on a second try to the other secondary node.