Cache Reader No keys found for HMAC that is valid for time

Hi,

Setup
PSMDB version: 4.4.9-10;
OS: CentOS 7.9;
MongoDB configuration: 3-node Replica Set: Primary-Secondary-Secondary, one of the Secondaries is hidden with priority 0. No Arbiter;
Each node is configured in a VM.

Issue
I faced an issue where an application that accesses a database in a MongoDB cluster with the configuration above, had the following error:
MongoDB.Driver.MongoCommandException: Command aggregate failed: Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1683718817, 1) } with id: 7195156194690859009.
This application is using the MongoDB C# driver v2.15.1 since the beginning, and does simple queries to fetch data. It is configured with readPreference=secondaryPreferred and the connection string points to the replica set.

There is a patching process which runs every week on the VMs of the MongoDB cluster that updates packages (excluding MongoDB-related) in every VM, making sure that it is a Secondary node (if not, failover) and restarting the VM after the updates. During a run of this process, there were two relevant restarts (excluding the mention of hidden node restart here since the application doesn’t reed from there). The error mentioned above happened during each restart (where the node was Secondary and the application was reading from there). They happened for around 10-13 seconds and triggered 50k errors each, on the application side, where the application was not able to read.

After investigating on the MongoDB side, I didn’t find any evidence of specific errors in the log. Also, I was able to check that the application was able to read from the Primary while the Secondary was unavailable, as it is supposed when using readPreference=secondaryPreferred. Then, when the Secondary node was back, the application pointed to the Secondary again.

Additional details about the first restart, as an example:
2023-05-10T11:39:15 - shutdown (with restart) on the Secondary started;
2023-05-10T11:39:16 - Primary node started to have connections;
2023-05-10T11:39:37 - last connection opened on the Primary node at (this application usually opens several connections per second, but it can also have gaps of a few seconds (5-10);
2023-05-10T11:40:15 - Secondary node was available for connections;
2023-05-10T11:40:17 - timestamp of the error on the driver - 1683718817;
2023-05-10T11:40:25 - Secondary node passed from STARTUP to STARTUP2;
2023-05-10T11:40:25 - Secondary node passed from STARTUP2 to RECOVERING;
2023-05-10T11:40:25 - Secondary node passed from RECOVERING to SECONDARY;
2023-05-10T11:40:25 - Secondary node started to have connections from the application again;
2023-05-10T11:40:25 - Errors started on the driver;
2023-05-10T11:40:38 - Errors stopped on the driver.

Other details
The patching process occurs very often (every week) so the restarts are frequent. This issue happened 4 times since the service is using the database on the cluster, for the last 8 months (since September 2022).
Occurrences:
#1 - November 2022 - 2 restarts - errors only in 1 of the restarts;
#2 - March 2023 - 2 restarts - errors only in 1 of the restarts;
#3 - March 2023 - 2 restarts - errors in the 2 restarts;
#4 - May 2023 - 2 restarts - errors in the 2 restarts (occurrence explained in this post).
There is only 1 database in this cluster. There are 2 other clusters, in different regions, with the same MongoDB configuration, and very similar workload and dataset, that never had any error. Also, I have several more MongoDB clusters configured the same way, for years, that never presented any issue like this one.

After some investigation online I found little information regarding this error but apparently, it has been resolved in previous versions of MongoDB. There was also an issue related to a GO driver.
Since I didn’t find any specific error on the MongoDB side, I will upgrade the C# driver to a more recent stable version 2.19.0 or 2.19.1 as a first step.
I found a great explanation about HMAC and $clusterTime that may be helpful in understanding the problem Error while converting a mongoDB Cluster into a Replica Set - Stack Overflow.

Is anyone aware of why this may be happening? Is there any known issue about this? If it is possible to be a bug on MongoDB, should I open a ticket with this information for further investigation?

Kind regards,
João Soares

Hi @joaorpsoares,

For seeing at the PSMDB version, I would recommend to upgrade to the latest minor patch of 4.4 version. This current version also have some serious bugs which are fixed in the newer patches. Kindly refer the release notes for bug details.

For the above issue, it seems to be a bug which is fixed in newer patch of 4.4. Kindly refer the MongoDB Forum for your reference.

Regards,
Parag

Hi @Parag_Bhayani,

Thank you for your answer.
I have seen those before but the issue stated on SERVER-40535 was apparently solved in previous versions that the one I’m using. Nevertheless, I understand that the advice in MongoDB forum is to upgrade to the latest version.
I was just trying to understand the issue better before going on that path.

I will evaluate it and get back to you. If you have any other suggestions, feel free to reach out.

Kind regards,
João Soares

Hello,

we are running into the same issue with

percona/percona-server-mongodb:6.0.4-3

Can anyone give me an advise what to do here? Thanks and kind regards!

1 Like

Hi @Parag_Bhayani,

Since my last message, I upgraded the MongoDB C# driver to v2.19.2 and tried to upgrade PSMDB to the latest minor version of 4.4, as you recommended, 4.4.22-21.
So, when I was upgrading the MongoDB replica set, 1 node at a time, I found an issue. I followed the procedure recommended by Percona to upgrade the replica set, starting by one secondary node. After upgrading the first one, the node started to lag behind the primary, never catching up again. In the MongoDB logs, I got the following messages on loop:

{"t":{"$date":"2023-08-04T09:14:04.369+00:00"},"s":"I",  "c":"NETWORK",  "id":20124,   "ctx":"ReplCoordExtern-0","msg":"DBClientConnection failed to send message","attr":{"connString":"mongodb-rset-node-2:27017","error":"SocketException: ###"}}
{"t":{"$date":"2023-08-04T09:14:04.369+00:00"},"s":"I",  "c":"REPL",     "id":21275,   "ctx":"ReplCoordExtern-0","msg":"Recreating cursor for oplog fetcher due to error","attr":{"lastOpTimeFetched":{"ts":{"$timestamp":{"t":1691140394,"i":78}},"t":28},"attemptsRemaining":1,"error":"HostUnreachable: ###"}}
{"t":{"$date":"2023-08-04T09:14:04.369+00:00"},"s":"I",  "c":"REPL",     "id":23437,   "ctx":"ReplCoordExtern-0","msg":"OplogFetcher reconnecting due to error","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"network error while attempting to run command 'isMaster' on host 'mongodb-rset-node-2:27017' "}}}
{"t":{"$date":"2023-08-04T09:14:04.370+00:00"},"s":"I",  "c":"NETWORK",  "id":20120,   "ctx":"ReplCoordExtern-0","msg":"Trying to reconnect","attr":{"connString":"mongodb-rset-node-2:27017 failed"}}
{"t":{"$date":"2023-08-04T09:14:09.386+00:00"},"s":"I",  "c":"NETWORK",  "id":20124,   "ctx":"ReplCoordExtern-0","msg":"DBClientConnection failed to send message","attr":{"connString":"mongodb-rset-node-2:27017","error":"SocketException: ###"}}
{"t":{"$date":"2023-08-04T09:14:09.386+00:00"},"s":"I",  "c":"NETWORK",  "id":20121,   "ctx":"ReplCoordExtern-0","msg":"Reconnect attempt failed","attr":{"connString":"mongodb-rset-node-2:27017 failed","error":""}}
{"t":{"$date":"2023-08-04T09:14:09.386+00:00"},"s":"I",  "c":"REPL",     "id":21274,   "ctx":"ReplCoordExtern-0","msg":"Error returned from oplog query (no more query restarts left)","attr":{"error":"SocketException: ###"}}
{"t":{"$date":"2023-08-04T09:14:09.386+00:00"},"s":"W",  "c":"REPL",     "id":21122,   "ctx":"BackgroundSync","msg":"Fetcher stopped querying remote oplog with error","attr":{"error":"SocketException: ###"}}
{"t":{"$date":"2023-08-04T09:14:09.387+00:00"},"s":"I",  "c":"REPL",     "id":21080,   "ctx":"BackgroundSync","msg":"Clearing sync source to choose a new one","attr":{"syncSource":"mongodb-rset-node-2:27017"}}
{"t":{"$date":"2023-08-04T09:14:09.387+00:00"},"s":"I",  "c":"REPL",     "id":21799,   "ctx":"BackgroundSync","msg":"Sync source candidate chosen","attr":{"syncSource":"mongodb-rset-node-2:27017"}}
{"t":{"$date":"2023-08-04T09:14:09.388+00:00"},"s":"I",  "c":"REPL",     "id":21087,   "ctx":"BackgroundSync","msg":"Chose same sync source candidate as last time. Sleeping to avoid immediately choosing a new sync source for the same reason as last time","attr":{"syncSource":"mongodb-rset-node-2:27017","sleepDurationMillis":1000}}
{"t":{"$date":"2023-08-04T09:14:15.405+00:00"},"s":"I",  "c":"NETWORK",  "id":20124,   "ctx":"ReplCoordExtern-0","msg":"DBClientConnection failed to send message","attr":{"connString":"mongodb-rset-node-2:27017","error":"SocketException: ###"}}
{"t":{"$date":"2023-08-04T09:14:15.405+00:00"},"s":"I",  "c":"REPL",     "id":21275,   "ctx":"ReplCoordExtern-0","msg":"Recreating cursor for oplog fetcher due to error","attr":{"lastOpTimeFetched":{"ts":{"$timestamp":{"t":1691140394,"i":78}},"t":28},"attemptsRemaining":1,"error":"HostUnreachable: ###"}}
{"t":{"$date":"2023-08-04T09:14:15.405+00:00"},"s":"I",  "c":"REPL",     "id":23437,   "ctx":"ReplCoordExtern-0","msg":"OplogFetcher reconnecting due to error","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"network error while attempting to run command 'isMaster' on host 'mongodb-rset-node-2:27017' "}}}
{"t":{"$date":"2023-08-04T09:14:15.406+00:00"},"s":"I",  "c":"NETWORK",  "id":20120,   "ctx":"ReplCoordExtern-0","msg":"Trying to reconnect","attr":{"connString":"mongodb-rset-node-2:27017 failed"}}
{"t":{"$date":"2023-08-04T09:14:20.423+00:00"},"s":"I",  "c":"NETWORK",  "id":20124,   "ctx":"ReplCoordExtern-0","msg":"DBClientConnection failed to send message","attr":{"connString":"mongodb-rset-node-2:27017","error":"SocketException: ###"}}
{"t":{"$date":"2023-08-04T09:14:20.423+00:00"},"s":"I",  "c":"NETWORK",  "id":20121,   "ctx":"ReplCoordExtern-0","msg":"Reconnect attempt failed","attr":{"connString":"mongodb-rset-node-2:27017 failed","error":""}}
{"t":{"$date":"2023-08-04T09:14:20.423+00:00"},"s":"I",  "c":"REPL",     "id":21274,   "ctx":"ReplCoordExtern-0","msg":"Error returned from oplog query (no more query restarts left)","attr":{"error":"SocketException: ###"}}
{"t":{"$date":"2023-08-04T09:14:20.423+00:00"},"s":"W",  "c":"REPL",     "id":21122,   "ctx":"BackgroundSync","msg":"Fetcher stopped querying remote oplog with error","attr":{"error":"SocketException: ###"}}
{"t":{"$date":"2023-08-04T09:14:20.423+00:00"},"s":"I",  "c":"REPL",     "id":21080,   "ctx":"BackgroundSync","msg":"Clearing sync source to choose a new one","attr":{"syncSource":"mongodb-rset-node-2:27017"}}
{"t":{"$date":"2023-08-04T09:14:20.423+00:00"},"s":"I",  "c":"REPL",     "id":21799,   "ctx":"BackgroundSync","msg":"Sync source candidate chosen","attr":{"syncSource":"mongodb-rset-node-2:27017"}}
{"t":{"$date":"2023-08-04T09:14:20.424+00:00"},"s":"I",  "c":"REPL",     "id":21087,   "ctx":"BackgroundSync","msg":"Chose same sync source candidate as last time. Sleeping to avoid immediately choosing a new sync source for the same reason as last time","attr":{"syncSource":"mongodb-rset-node-2:27017","sleepDurationMillis":1000}}

Lag on the upgraded node:

rsSET:PRIMARY> rs.printSecondaryReplicationInfo()
source: mongodb-rset-node-1:27017
	syncedTo: Fri Aug 04 2023 09:15:22 GMT+0000 (UTC)
	0 secs (0 hrs) behind the primary
source: mongodb-rset-node-3:27017
	syncedTo: Fri Aug 04 2023 09:13:14 GMT+0000 (UTC)
	128 secs (0.04 hrs) behind the primary

So, I still have the HMAC error mentioned in this post and I cannot upgrade successfully due to the OplogFetcher not being able to fetch oplogs from the Primary, after the upgrade. Something seems odd about this. I already recreated the VMs, OS disk, data disks, NICs and the behaviour is always the same, in the situations described. There may be an issue with PSMDB. Are you able to help? Should I open a ticket on Jira to Percona?

Note: I also tried to upgrade to 4.4.21-20 and I got the same behaviour. Also, after downgrading the upgraded node to the version I’m using (4.4.9-10) the OplogFetcher was able to fetch the oplogs again.

Kind regards,
João Soares

In my case it was a config-problem in my cr.yaml

replsets:

  - name: rs0

    configuration: |

      net:

        tls:

          mode: requireTLS

      security:

        clusterAuthMode: keyfile

The last two lines I had added as a workaround when using with no sharding - and my certificates had not have the fields needed yet.

I have just removed that marked lines and deployed a new instance, what fixed the problem.

Hi,

Thank you for sharing your resolution.
For this specific case, it didn’t help. :confused:

@Parag_Bhayani any news on this?

Kind regards,
João Soares

Hello @joaorpsoares

There are two timestamps in this error message:

MongoDB.Driver.MongoCommandException: Command aggregate failed: Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1683718817, 1) } with id: 7195156194690859009

first timestamp is epoch 1683718817: May 10, 2023 11:40:17 AM
second timestamp is encoded in Id: 7195156194690859009 which is Hex 63DA547000000001. The upper 32 bits of this value are 0x63DA5470 = epoch 1675252848 = February 1, 2023 12:00:48 PM

The first timestamp describes when key should be valid and the second timestamp describes when key was created. MongoDB internally generates those keys to always have keys for the next 3 months. But in this case the difference is slightly more than 3 months so the key lookup fails.

If my understanding is correct you only observe these error messages after restart of VM and only during short period of 10-13 seconds. That probably means that:

  • VM starts with incorrect system time in the past
  • mongo server starts with system time in the past and generates HMAC keys for that time
  • then system time inside VM jumps to the current value (in your example the jump is more than 3 months)
  • all keys in the cache becomes outdated until cache is updated and new keys are created (default refresh time is 30 seconds).
  • when keys cache is refreshed all reads are ok again

So if this theory with system time change inside VM is correct then to solve the issue you can do something like forcing VM time update before starting mongo binaries.