PMM Server deployed on k8s - failed to establish two-way communication channel

Hi guys,

Can the PMM Server be deployed on k8s?
I’m trying to deploy the PMM Server on k8s but I’m facing many errors on client side and server doesn’t show the source/client on dashboard.

pmm-admin config --server-insecure-tls --server-url=https://:@:443/ --force --debug --trace

DEBUG 2023-06-09 15:11:30.920271064Z: github/percona/pmm/admin/commands.(*ConfigCommand).RunCmd() Running: pmm-agent --server-address=:443 --server-username=admin --server-password= --listen-port=7777 --server-insecure-tls --log-level=warn --debug --trace --log-lines-count=1024 setup --force --metrics-mode=auto
DEBUG 2023-06-09 15:11:34.517475882Z: /tmp/go/src/github/percona/pmm/admin/cli/cli.go:130 github/percona/pmm/admin/cli.printResponse() Result: &commands.configResult{Warning:“”, Output:“Checking local pmm-agent status…\npmm-agent is running.\nRegistering pmm-agent on PMM Server…\nRegistered.\nConfiguration file /usr/local/percona/pmm2/config/pmm-agent.yaml updated.\nReloading pmm-agent configuration…\nConfiguration reloaded.\nChecking local pmm-agent status…\npmm-agent is running.”}
DEBUG 2023-06-09 15:11:34.517518075Z: /tmp/go/src/github/percona/pmm/admin/cli/cli.go:131 github/percona/pmm/admin/cli.printResponse() Error:
Checking local pmm-agent status…
pmm-agent is running.
Registering pmm-agent on PMM Server…
Registered.
Configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml updated.
Reloading pmm-agent configuration…
Configuration reloaded.
Checking local pmm-agent status…
pmm-agent is running.

Client errors:

systemctl status pmm-agent

Jun 07 15:30:49 pmm-agent[99132]: ERRO[2023-06-07T15:30:49.875-03:00]client/client.go:789 client.dial Failed to establish two-way communication channel: context canceled. component=client
Jun 07 15:31:38 pmm-agent[68057]: TRAC[2023-06-07T15:31:38.437-03:00]config/logger.go:37 config.(*gRPCLogger).Infoln [core] [Channel #16] Channel Connectivity change to SHUTDOWN component=grpclog
Jun 07 15:31:38 pmm-agent[68057]: TRAC[2023-06-07T15:31:38.437-03:00]config/logger.go:37 config.(*gRPCLogger).Infoln [channelz] attempt to delete child with id 17 from a parent (id=16) that doesn’t currently exist component=grpclog

Client Info:

pmm-admin --version

ProjectName: pmm-admin
Version: 2.37.0
PMMVersion: 2.37.0
Timestamp: 2023-04-25 10:31:27 (UTC)
FullCommit: f85fa9ac545dc3ab2d14a523b6160ba5b01061bc

Server Info:

NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
pmm percona 9 2023-05-25 21:21:44.631600302 -0300 -03 deployed pmm-1.2.3 2.37.0

k8s version: 1.21.14.

curl -X GET --dump-header /dev/stdout https:///v1/readyz

HTTP/1.1 200 OK
Date: Fri, 09 Jun 2023 16:11:37 GMT
Content-Type: application/json
Content-Length: 2
Connection: keep-alive
Grpc-Metadata-Content-Type: application/grpc
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Cache-control: no-cache
Pragma: no-cache
Strict-Transport-Security: max-age=15724800; includeSubDomains

echo -n . | openssl s_client -connect :443 | openssl x509 -noout -dates -subject -issuer

depth=3 O = Digital Signature Trust Co., CN = DST Root CA X3
verify error:num=10:certificate has expired
notAfter=Sep 30 14:01:15 2021 GMT
DONE
notBefore=Jun 4 11:51:13 2023 GMT
notAfter=Sep 2 11:51:12 2023 GMT
subject= /CN=*.domain.cloud
issuer= /C=US/O=Let’s Encrypt/CN=R3

The certificate is up to date (Sep 2, 2023).

There are no firewall blocks.
Client is able to telnet to pmm server via port 443.
PMM server is able to telnet client via ports 4200x.

Any idea what is going on?
Can the PMM Server really be deployed on k8s?

Ps: my tests on virtual machine (WITHOUT k8s) worked successfully.

Hey @bruno_aleon - thnx for asking!
It can be deployed on k8s.
We recommend to deploy PMM server through a helm chart.

Then it is business as usual.

The problem you are seeing can be caused by various network or proxy issues. Could you please show the diagram of the end-state? Where is the client, how does it connect to the server, etc. Are you using ingress? Where do you terminate TLS?

Hi Sergey,

Thank you very much for the quick response and patience.

Yes, my deployment was through a helm chart and we are using INGRESS. Could this be a problem?

Because I found this card in jira but I don’t know if it might be related to my problem:
https://jira.percona.com/browse/PMM-11872

My connection flow is basically POD → SVC → INGRESS.
TLS use LetsEncrypt.

PMM server is located in 1 pod and our databases distributed in virtual machines in another network, which we have already released on the firewall according to the pre-reqs.

Telnet tests successfully pass through the firewall. For example, telnet from PMM Server (POD) to “27p” (client in a virtual machine):

There is no proxy between environments.

I guess it is due to GRPC protocol.

@steve.hoffman @Roma_Novikov I remember you told me about the probable issues with GRPC.

I had this same issue and was able to work-around it by adding annotations to the helm ingress values:

  community:
    annotations:
      nginx.ingress.kubernetes.io/use-regex: "true"

The nginx.ingress.kubernetes.io/use-regex: “true” was needed to have the ingress controller use regex location blocks for ‘/agent., /inventory.’ etc… without this config, such calls to these urls would not match the location blocks in the ingress controller and instead hit the default location / and its non-grpc upstream.