MongoDB fails to start due to WiredTiger error

I have a deployment of three replica sets of MongoDB. Suddenly, one of them crashed due to unknown reasons while others were working fine.

After the crash, one of the replica sets was not able to start - it showed WiredTiger error as a reason:

{"t":{"$date":"2024-03-02T21:50:34.955+00:00"},"s":"E",  "c":"WT",       "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error message","attr":{"error":0,"message":{"ts_sec":1709416234,"ts_usec":955926,"thread":"1:0x7fbb49e8fc40","session_dhandle_name":"file:index-41--2757559245873504977.wt","session_name":"WT_SESSION.open_cursor","category":"WT_VERB_EXTENSION","category_id":14,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"libcrypto: error:06065064:digital envelope routines:EVP_DecryptFinal_ex:bad decrypt:crypto/evp/evp_enc.c:643:\n"}}}
{"t":{"$date":"2024-03-02T21:50:34.956+00:00"},"s":"E",  "c":"WT",       "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error message","attr":{"error":0,"message":{"ts_sec":1709416234,"ts_usec":956000,"thread":"1:0x7fbb49e8fc40","session_dhandle_name":"file:index-41--2757559245873504977.wt","session_name":"WT_SESSION.open_cursor","category":"WT_VERB_EXTENSION","category_id":14,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"setting return code to WT_PANIC"}}}
{"t":{"$date":"2024-03-02T21:50:34.956+00:00"},"s":"E",  "c":"WT",       "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error message","attr":{"error":-31804,"message":{"ts_sec":1709416234,"ts_usec":956018,"thread":"1:0x7fbb49e8fc40","session_dhandle_name":"file:index-41--2757559245873504977.wt","session_name":"WT_SESSION.open_cursor","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_btree_tree_open:639:unable to read root page from file:index-41--2757559245873504977.wt","error_str":"WT_PANIC: WiredTiger library panic","error_code":-31804}}}

As you can see, it shows unable to read root page from file:index-41--2757559245873504977.wt as an error message. I thought the data got corrupted, so what I did - created a completely new file share, attached it to the replica set and allowed the replication to fill it up with data. Once it was finished, the same error once again appeared (I do not have the original error, so the one I have provided comes from the new data).

What could be my options to fix this problem? It seems that the data is somehow corrupted and I am not able to successfully replicate it. I have tried deleting that index file, but of course, it hasn’t worked out. Should I run a repair? What is your opinion?

Hey Gvidas,
If I understand correctly, you still have two nodes working fine and without any issues.
What can you try;

  • Take a snapshot from one of the working nodes and restore it somewhere;
    • try to restart it to see if anything went wrong
    • try to reindex that collection
    • try to repair
  • Check dmesg messages from the dead node to find hardware issues if you have any.

Re-index should work in your case, but since you mentioned different error messages, that might not work.