Hi,
Setup
PSMDB version: 4.4.9-10;
OS: CentOS 7.9;
MongoDB configuration: 3-node Replica Set: Primary-Secondary-Secondary, one of the Secondaries is hidden with priority 0. No Arbiter;
Each node is configured in a VM.
Issue
I faced an issue where an application that accesses a database in a MongoDB cluster with the configuration above, had the following error:
MongoDB.Driver.MongoCommandException: Command aggregate failed: Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1683718817, 1) } with id: 7195156194690859009.
This application is using the MongoDB C# driver v2.15.1 since the beginning, and does simple queries to fetch data. It is configured with readPreference=secondaryPreferred
and the connection string points to the replica set.
There is a patching process which runs every week on the VMs of the MongoDB cluster that updates packages (excluding MongoDB-related) in every VM, making sure that it is a Secondary node (if not, failover) and restarting the VM after the updates. During a run of this process, there were two relevant restarts (excluding the mention of hidden node restart here since the application doesn’t reed from there). The error mentioned above happened during each restart (where the node was Secondary and the application was reading from there). They happened for around 10-13 seconds and triggered 50k errors each, on the application side, where the application was not able to read.
After investigating on the MongoDB side, I didn’t find any evidence of specific errors in the log. Also, I was able to check that the application was able to read from the Primary while the Secondary was unavailable, as it is supposed when using readPreference=secondaryPreferred
. Then, when the Secondary node was back, the application pointed to the Secondary again.
Additional details about the first restart, as an example:
2023-05-10T11:39:15
- shutdown (with restart) on the Secondary started;
2023-05-10T11:39:16
- Primary node started to have connections;
2023-05-10T11:39:37
- last connection opened on the Primary node at (this application usually opens several connections per second, but it can also have gaps of a few seconds (5-10);
2023-05-10T11:40:15
- Secondary node was available for connections;
2023-05-10T11:40:17
- timestamp of the error on the driver - 1683718817;
2023-05-10T11:40:25
- Secondary node passed from STARTUP to STARTUP2;
2023-05-10T11:40:25
- Secondary node passed from STARTUP2 to RECOVERING;
2023-05-10T11:40:25
- Secondary node passed from RECOVERING to SECONDARY;
2023-05-10T11:40:25
- Secondary node started to have connections from the application again;
2023-05-10T11:40:25
- Errors started on the driver;
2023-05-10T11:40:38
- Errors stopped on the driver.
Other details
The patching process occurs very often (every week) so the restarts are frequent. This issue happened 4 times since the service is using the database on the cluster, for the last 8 months (since September 2022).
Occurrences:
#1 - November 2022 - 2 restarts - errors only in 1 of the restarts;
#2 - March 2023 - 2 restarts - errors only in 1 of the restarts;
#3 - March 2023 - 2 restarts - errors in the 2 restarts;
#4 - May 2023 - 2 restarts - errors in the 2 restarts (occurrence explained in this post).
There is only 1 database in this cluster. There are 2 other clusters, in different regions, with the same MongoDB configuration, and very similar workload and dataset, that never had any error. Also, I have several more MongoDB clusters configured the same way, for years, that never presented any issue like this one.
After some investigation online I found little information regarding this error but apparently, it has been resolved in previous versions of MongoDB. There was also an issue related to a GO driver.
Since I didn’t find any specific error on the MongoDB side, I will upgrade the C# driver to a more recent stable version 2.19.0 or 2.19.1 as a first step.
I found a great explanation about HMAC and $clusterTime
that may be helpful in understanding the problem Error while converting a mongoDB Cluster into a Replica Set - Stack Overflow.
Is anyone aware of why this may be happening? Is there any known issue about this? If it is possible to be a bug on MongoDB, should I open a ticket with this information for further investigation?
Kind regards,
João Soares