Description:
On 03/01 Saturday at around 15:53 the mongos
pods of our deployment crashed on our kubernetes cluster.
Percona operator: 1.9.1
Kubernetes version: 1.30.3
image: percona/percona-server-mongodb:6.0.9-7
What happened is the mongos
pods became unavailable, stopped responding, none of the applications communicating with it worked, the person at hand restarted the pods and the issue ceased, however this was noticed too late and the database is deployed in production environment, so we need to investigate the issue to make sure it does not happen again.
Since this happened in production and we don’t know why, we haven’t been able to reproduce this issue.
These pods generate a tremendous amount of logs so I’ll just paste the most important ones:
Different user name was supplied to saslSupportedMechs
This kind of log showed up before the outage sporadically on the cfg
pods but became the dominant error during the outage.
SSL peer certificate validation failed reason: certificate has expired
These logs only appeared when the outage happened, for our applications we use SCRAM and not certificates.
The internal certificates are and were valid:
prod-01-ca-cert
Validity
Not Before: Aug 3 14:51:20 2024 GMT
Not After : Aug 3 14:51:20 2025 GMT
prod-01-ssl
Validity
Not Before: Jan 30 14:51:21 2025 GMT
Not After : Apr 30 14:51:21 2025 GMT
prod-01-ssl-internal
Validity
Not Before: Jan 30 14:51:21 2025 GMT
Not After : Apr 30 14:51:21 2025 GMT
These logs were found in the mongos
pod after it became unavailable:
Dropping all pooled connections AuthenticationFailed: No verified subject name available from client
Error while attempting to write this node's uptime to config.mongos No verified subject name available from client
Failed to refresh readConcern/writeConcern defaults from config server No verified subject name available from client
Also, after finding the outage, this could be seen in the shard pods:
Error running periodic reload of shard registry No verified subject name available from client
The error itself has not happened since, and I did not find anything that would lead me to any conclusion, if you need any more information I’ll be more than happy to oblige.