Sudden `mongos` crash with no recovery

Description:

On 03/01 Saturday at around 15:53 the mongos pods of our deployment crashed on our kubernetes cluster.

Percona operator: 1.9.1
Kubernetes version: 1.30.3
image: percona/percona-server-mongodb:6.0.9-7

What happened is the mongos pods became unavailable, stopped responding, none of the applications communicating with it worked, the person at hand restarted the pods and the issue ceased, however this was noticed too late and the database is deployed in production environment, so we need to investigate the issue to make sure it does not happen again.

Since this happened in production and we don’t know why, we haven’t been able to reproduce this issue.

These pods generate a tremendous amount of logs so I’ll just paste the most important ones:

Different user name was supplied to saslSupportedMechs
This kind of log showed up before the outage sporadically on the cfg pods but became the dominant error during the outage.

SSL peer certificate validation failed reason: certificate has expired
These logs only appeared when the outage happened, for our applications we use SCRAM and not certificates.

The internal certificates are and were valid:

prod-01-ca-cert
Validity
            Not Before: Aug  3 14:51:20 2024 GMT
            Not After : Aug  3 14:51:20 2025 GMT


prod-01-ssl
    Validity
        Not Before: Jan 30 14:51:21 2025 GMT
        Not After : Apr 30 14:51:21 2025 GMT


prod-01-ssl-internal
    Validity
        Not Before: Jan 30 14:51:21 2025 GMT
        Not After : Apr 30 14:51:21 2025 GMT

These logs were found in the mongos pod after it became unavailable:

Dropping all pooled connections AuthenticationFailed: No verified subject name available from client
Error while attempting to write this node's uptime to config.mongos No verified subject name available from client
Failed to refresh readConcern/writeConcern defaults from config server No verified subject name available from client

Also, after finding the outage, this could be seen in the shard pods:
Error running periodic reload of shard registry No verified subject name available from client

The error itself has not happened since, and I did not find anything that would lead me to any conclusion, if you need any more information I’ll be more than happy to oblige.

Hi @taglas_tamas, thanks for sharing your experiences. I’m sorry you and your team had some outages at the production instance. I think you haven’t asked any question in your post - what’s your expectation from the community on this?

1 Like

Thank you for your response,

My primary goal would be to ask if anyone dealt with a similar issue, or is there something I can do either to prevent this from happening again, or to gain more insight on what could be the root cause.

Of course, there is always the option to just consider this a fluke, and hope that the newest version has dealt with this, but if there is something misconfigured that can cause an outage like we experienced, I have to consider that a priority and fix it posthaste.