PostgresSQL fails to trust kubernetes api certificate

When I deploy postgres on a kops cluster, the api calls to the kubernetes api fail because it does not trust the kubernetes api certificate. How can I make percona postgres trust my kuberntes api certificate?

Here is the error I am getting.

2022-07-06 20:31:11,485 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:897)'),)': /api/v1/namespaces/pg3/pods?labelSelector=vendor%3Dcrunchydata%2Ccrunchy-pgha-scope%3Dpostgres
2022-07-06 20:31:11,487 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:897)'),)': /api/v1/namespaces/pg3/configmaps?labelSelector=vendor%3Dcrunchydata%2Ccrunchy-pgha-scope%3Dpostgres
2022-07-06 20:31:11,494 ERROR: Request to server https://100.64.0.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='100.64.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/pg3/pods?labelSelector=vendor%3Dcrunchydata%2Ccrunchy-pgha-scope%3Dpostgres (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:897)'),))",)
2022-07-06 20:31:11,495 ERROR: Request to server https://100.64.0.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='100.64.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/pg3/configmaps?labelSelector=vendor%3Dcrunchydata%2Ccrunchy-pgha-scope%3Dpostgres (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:897)'),))",)
2022-07-06 20:31:11,804 ERROR: get_cluster
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 704, in _load_cluster
self._wait_caches(stop_time)
File "/usr/lib/python3.6/site-packages/patroni/dcs/kubernetes.py", line 696, in _wait_caches
raise RetryFailedError('Exceeded retry deadline')
patroni.utils.RetryFailedError: 'Exceeded retry deadline'
2022-07-06 20:31:11,804 WARNING: Can not get cluster from dcs
2022-07-06 20:31:12,495 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2022-07-06 20:31:12,496 ERROR: ObjectCache.run TypeError("unsupported operand type(s) for -: 'NoneType' and 'float'",)
2022-07-06 20:31:12,502 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:897)'),)': /api/v1/namespaces/pg3/pods?labelSelector=vendor%3Dcrunchydata%2Ccrunchy-pgha-scope%3Dpostgres
3 Likes

I believe the issue is because the following certificate authority is not trusted. Why is percona not trusting the kubernetes certificate authority? Seems like a basic requirement to me. The following certificate should be trusted.

/var/run/secrets/kubernetes.io/serviceaccount/ca.crt

As a basic example, if I call the kubernetes api with curl, it will fail with a certificate error, which is expected because the kubernetes certificate authority is not trusted by pods by default.

curl https://kubernetes.default
curl: (60) SSL certificate problem: self signed certificate in certificate chain
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

In order to call kubernetes api’s I need to explicity trust the kubernetes certificate authority. I can do this with curl in the following example.

curl https://kubernetes.default --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

The fact that percona calls to the kubernetes api is not trusting the /var/run/secrets/kubernetes.io/serviceaccount/ca.crt certificate is creating big problems for certain cluster setups, specifically in my case kops clusters.

What steps need to be taken to ensure percona api calls to kubernets trust this certificate authority?

1 Like

I cannot copy the certificate because readOnlyRootFilesystem is enabled. Can someone on the percona team please address this issue. This is a critical blocker for several people.

1 Like

Hi @Clay_Risser,

I spent some time to understand the problem you described, but still have a hard time. Usually the tools that work with Kubernetes API server add this service account CA certificate their trust pools. For example:

client-go:

func InClusterConfig() (*Config, error) {
	const (
		tokenFile  = "/var/run/secrets/kubernetes.io/serviceaccount/token"
		rootCAFile = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
	)
	host, port := os.Getenv("KUBERNETES_SERVICE_HOST"), os.Getenv("KUBERNETES_SERVICE_PORT")
	if len(host) == 0 || len(port) == 0 {
		return nil, ErrNotInCluster
	}

	token, err := ioutil.ReadFile(tokenFile)
	if err != nil {
		return nil, err
	}

	tlsClientConfig := TLSClientConfig{}

	if _, err := certutil.NewPool(rootCAFile); err != nil {
		klog.Errorf("Expected to load root CA config from %s, but got err: %v", rootCAFile, err)
	} else {
		tlsClientConfig.CAFile = rootCAFile
	}

	return &Config{
		// TODO: switch to using cluster DNS.
		Host:            "https://" + net.JoinHostPort(host, port),
		TLSClientConfig: tlsClientConfig,
		BearerToken:     string(token),
		BearerTokenFile: tokenFile,
	}, nil
}

patroni:

    def load_incluster_config(self, ca_certs=SERVICE_CERT_FILENAME,
                              token_refresh_interval=datetime.timedelta(minutes=1)):
        if SERVICE_HOST_ENV_NAME not in os.environ or SERVICE_PORT_ENV_NAME not in os.environ:
            raise self.ConfigException('Service host/port is not set.')
        if not os.environ[SERVICE_HOST_ENV_NAME] or not os.environ[SERVICE_PORT_ENV_NAME]:
            raise self.ConfigException('Service host/port is set but empty.')

        if not os.path.isfile(ca_certs):
            raise self.ConfigException('Service certificate file does not exists.')
        with open(ca_certs) as f:
            if not f.read():
                raise self.ConfigException('Cert file exists but empty.')
        self.pool_config['ca_certs'] = ca_certs
        self._token_refresh_interval = token_refresh_interval
        token = self._read_token_file()
        self._make_headers(token=token)
        self._server = uri('https', (os.environ[SERVICE_HOST_ENV_NAME], os.environ[SERVICE_PORT_ENV_NAME]))

It feels like something might be wrong in your cluster. Could you share more details about your setup? Why do you think clusters installed by kops is prone to such errors?

1 Like

Hmm, I’m not sure. I’ll try and dig into this more. If you have any suggestions, please let me know. I believe my kops cluster is setup correctly, because the following command runs with no errors.

curl https://kubernetes.default --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
1 Like

@Ege_Gunes I just noticed you were looking at the wrong files. Make sure you’re looking at Patroni 2.1.1 because that is the Patroni version percona Postgres operator 1.2.0 is using.

1 Like

Here is exactly where it is failing.

1 Like

Seems like the following issue is closely related.

1 Like

The following issue is also related.

1 Like

One more related issue.

1 Like

@Ege_Gunes I figured out that the reason it’s not working is because python does not trust intermediate certificates, and the ca that kops generates is an intermediate certificate.

Anyone have any idea how to prevent kops from creating intermediate certificates?

1 Like

@Ege_Gunes I’ve just installed the percona pg operator 2.2.0 then I created a cluster with the pg-db helm chart. The instance fails to start, because fails to verify the kubernetes api certificate
The Kubernetes using an intermediate CA so I think the problem is the same as I seen in the thread.
I’m wondering about if I can provide the full CA chain it can be work? I haven’t seen option in the crd to mount a volume to the pg pods. Can you suggest how to solve this issue?