Pmm-agent can not connect to pmm-server when using reverse proxy

What I’m trying to achieve:

  • Run pmm server in docker container on "server", expose port 443 as 127.0.0.1:8443
  • Use nginx as reverse proxy for 127.0.0.1:8443. This way nginx can handle TLS using certbot and also serve other domains.
  • Connect pmm-agent running on "client" to "server"

What I’ve done:

deployed pmm-server image on docker like this:

docker run --detach --restart always --publish 127.0.0.1:8443:443 --volumes-from pmm-data --name pmm-server percona/pmm-server:2

configured nginx site:

server {
    server_name server.domain;
    location / {
        include /etc/nginx/proxy_params;
        proxy_pass https://127.0.0.1:8443;
    }

    listen [::]:443 ssl ipv6only=on; # managed by Certbot
    listen 443 ssl; # managed by Certbot
    ssl_certificate /etc/letsencrypt/live/server.domain/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/server.domain/privkey.pem; # managed by Certbot
    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot

}
server {
    if ($host = server.domain) {
        return 301 https://$host$request_uri;
    } # managed by Certbot
    listen 80;
    listen [::]:80;
    server_name server.domain;
    return 404; # managed by Certbot
}

proxy_params:

proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

configured client:

pmm-admin config --server-url=https://user:pass@server.domain:443 server_ip generic server

What happens:

pmm-admin status:

"Failed to get PMM Agent status from local pmm-agent: pmm-agent is not connected to PMM Server."

pmm-agent --debug:

ERRO[2020-12-23T12:33:12.822+01:00] Failed to connect to server.domain:443: timeout.  component=client
INFO[2020-12-23T12:33:13.605+01:00] Connecting to https://user:***@server.domain:443/ ...  component=client

What I’ve tried:

If i disable nginx and publish docker on 0.0.0.0:443->443 and use --server-insecure-tls, the pmm-agent connects without any problem.

Hi Andrej,

Can you telnet to “server.domain:443” to see if it actually connects ? This does not look like PMM problem but some problem in the configuration, meaning what connections to server.domain:443 fail. It may be Nginx is not configured properly or might be there is firewall in between ?

1 Like

have you configured your pmm server for a custom cert? You’ve got multiple layers of encryption going on in this double proxy config (external nginx and pmm’s nginx). So assuming you’ve left the default certs in place, you need to enable your external nginx to trust the upstream cert or it’ll fail as you have to get your nginx to decrypt the pmm nginx traffic and then reencrypt with your let’s encrypt cert to serve the content to the client. I think you will need to preinstall the cert/key on your nginx in the “location” stanza. Just be aware this IS the definition of a “man in the middle” so be sure that this machine is secured as it’s a central point to sniff creds.

here is the docs for various options for an upstream service that is encrypted using https.

edit (forgot link)

1 Like

Can you telnet to “server.domain:443” to see if it actually connects ? This does not look like PMM problem but some problem in the configuration, meaning what connections to server.domain:443 fail. It may be Nginx is not configured properly or might be there is firewall in between ?

Hi Peter,

when connecting to server.domain from my PC I can see graphana, log in and see pmm-server stats no problem. Just to be sure that client can connect to server as well I tested it using telnet and curl and both can connect to server.domain. However I noticed that I can’t access /prometheus but can access /graph and /qan without any problem. When trying to access /prometheus from my PC after logging into graphana I get this error message:

remoteAddr: "127.0.0.1:46560", X-Forwarded-For: "ip of my PC"; unsupported path requested: "/"

Also to address Steve’s comment: Nginx acepts self signed certificates due to the default value of proxy_ssl_verify being off. I’m aware of what this means and even though the pmm-server docker is running on same machine as my nginx reverse proxy and docker being only exposed on localhost, I’ll make sure to configure custom certs after I get this working.

1 Like

Are you running 2.12.0? In that version we replaced prometheus with VictoriaMetrics so the default /prometheus landing page no longer exists. We still have many of the endpoints that apply in this version (/prometheus/targets/, /prometheus/rules/) you can see all details here. but that only explains why /prometheus doesn’t work.

Still trying to think about why clients wouldn’t work the exact same way your manual tests do though…there’s nothing technically different about what you’ve done by hand and what the client does? My mind keeps going to timeout values between client <–> external nginx <–> and pmm nginx but feels like it should just reestablish itself but I’ve seen stranger things with apache’s reverse proxy and having to have harmony between all configurations.

Am I correct to assume the pmm-admin config command works successfully in registering the node but as soon as you run pmm-admin status it’s showing disconnected? is that an instant fail or after say 5, 15 or 100 (I think nginx default) seconds?

1 Like

Yes I’m running 2.12.0 so that explains that.

You are correctly assuming that pmm-admin config runs successfuly :

Checking local pmm-agent status...
pmm-agent is running.
Registering pmm-agent on PMM Server...
Registered.
Configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml updated.
Reloading pmm-agent configuration...
Configuration reloaded.
Checking local pmm-agent status...
pmm-agent is running.

Running pmm-admin status shows this:

"Failed to get PMM Agent status from local pmm-agent: pmm-agent is not connected to PMM Server."

Running pmm-agent --debug --config-file=/usr/local/percona/pmm2/config/pmm-agent.yaml shows that the timeout happens exactly 5s after connection attempt:

INFO[2020-12-28T18:14:30.697+01:00] Connecting to https://user:***@server.domain:443/ ...  component=client

ERRO[2020-12-28T18:14:35.697+01:00] Failed to connect to server.domain:443: timeout.  component=client

INFO[2020-12-28T18:14:37.108+01:00] Connecting to https://user:***@server.domain:443/ ...  component=client

ERRO[2020-12-28T18:14:42.109+01:00] Failed to connect to server.domain:443: timeout.  component=client

INFO[2020-12-28T18:14:44.025+01:00] Connecting to https://user:***@server.domain:443/ ...  component=client

ERRO[2020-12-28T18:14:49.025+01:00] Failed to connect to server.domain:443: timeout.  component=client
1 Like

Hi @Andrej

We use gRPC to communicate between pmm-agent and PMM Server and reverse proxy may doesn’t work for that case.

Probably you should pass gRPC requests through grpc_pass like we do here https://github.com/percona/pmm-server/blob/PMM-2.0/nginx.conf#L160

list of locations which use gRPC

/agent.

/inventory.

/management.

/server.

1 Like

Hi @nurlanThanks for setting me on the right path.

Firstly I fixed the timeout issue by adding http2 parameter to listen directive like this:

listen 443 ssl http2;

After this change the client started showing this error:

DEBU[2020-12-29T13:27:04.528+01:00] Sending message (4 bytes): id:1  ping:{}.     component=channel
DEBU[2020-12-29T13:27:09.348+01:00] Closing with error: rpc error: code = Canceled desc = context canceled
failed to receive message
github.com/percona/pmm-agent/client/channel.(*Channel).runReceiver
        /tmp/go/src/github.com/percona/pmm-agent/client/channel/channel.go:199
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1373  component=channel
DEBU[2020-12-29T13:27:09.348+01:00] Exiting receiver goroutine.                   component=channel
ERRO[2020-12-29T13:27:09.348+01:00] Failed to establish two-way communication channel: context canceled.  component=client
DEBU[2020-12-29T13:27:09.348+01:00] Connection closed.                            component=client

So next I added:

upstream managed-grpc {
  server 127.0.0.1:8443;
  keepalive 32;
}

and

location /agent. {
    grpc_pass grpc://managed-grpc;
    client_max_body_size 0;
}
location /inventory. {
    grpc_pass grpc://managed-grpc;
}
location /management. {
    grpc_pass grpc://managed-grpc;
}
location /server. {
    grpc_pass grpc://managed-grpc;
}

This resulted in:

DEBU[2020-12-29T13:34:42.453+01:00] Sending message (4 bytes): id:1  ping:{}.     component=channel
DEBU[2020-12-29T13:34:42.495+01:00] Closing with error: rpc error: code = Unavailable desc = Bad Gateway: HTTP status code 502; transport: received the unexpected content-type "text/html"
failed to receive message
github.com/percona/pmm-agent/client/channel.(*Channel).runReceiver
        /tmp/go/src/github.com/percona/pmm-agent/client/channel/channel.go:199
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1373  component=channel
DEBU[2020-12-29T13:34:42.495+01:00] Exiting receiver goroutine.                   component=channel
ERRO[2020-12-29T13:34:42.495+01:00] Failed to establish two-way communication channel: Bad Gateway: HTTP status code 502; transport: received the unexpected content-type "text/html".  component=client
DEBU[2020-12-29T13:34:42.496+01:00] Connection closed.                            component=client

So currently I’m trying to figure out why grpc_pass fails with 502 status code. If I figure it out I’ll update you. If you have any ideas please let me know.

1 Like

Hi @Andrej,
could you try to replace grpc:// with grpcs://, please?

And if it doesn’t work please check this message Unable to setup proxy between pmm-client and server - #3 by jojojoseff

2 Likes

Hi @nurlan,
What I ended up doing is that I exposed docker on port 8443 and setup script that copies letsencrypt certificates into /srv/nginx in pmm-server. So I’m not using proxy to connect agents but I’m using it to access grafana. I do feel like using grpcs:// would solve the issue but I had to move on unfortunately. This way I also don’t have to worry about changing locations which use gRPC if they change in future.
Thanks everyone for help!

2 Likes

the solution @Andrej provided is the correct one except since nginx reverse proxy is trying to connect to nginx inside docker using grpc on a secure port using SSL you should use grpcs instead of grpc

here is the correct config:

location /agent. {
    grpc_pass grpcs://managed-grpc;
    client_max_body_size 0;
}
location /inventory. {
    grpc_pass grpcs://managed-grpc;
}
location /management. {
    grpc_pass grpcs://managed-grpc;
}
location /server. {
    grpc_pass grpcs://managed-grpc;
}
1 Like

Yeah, I can confirm that this works. I had essentially the same issue as OP, and after setting up the standard minimal nginx proxy config I typically use, I just had to add the grpc_pass bits referenced here.

Thanks!

1 Like

Although! Let me add that I seem to have broken the news feed and update panels in the UI.

News feed renders like this

and the update panel a) takes a long time to render, and b) always says I am up to date even though I am on 2.28.02.

Browser devtools also show 400s for things like wss://pmm-prod-01.delphi.cmu.edu/graph/api/live/ws.

Is it all related to proxying with Nginx? Am I missing further routes in the config?

Thanks,

1 Like

@Yassine_Afnisse Do you have any insight into this by chance?

1 Like

It’s possible the wss error is related…while websockets work over http(s) there’s more you need in your nginx to allow the connection upgrades that websockets require:

      # WebSocket support
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade"; 

I would like to think it’s that easy but I have the sneaking suspicion the SSL is going to trip things up considering you have to terminate a websocket at your proxy with certificate 1 and then re-encrypt with a second websocket connection with certificate 2 (the one installed on PMM server) but i’ll keep my fingers crossed for you there!

The other issue is now on the outbound side vs the inbound (all the configs above would only come into play if the connection were initiated from your external proxy to PMM server). News and version updates are requests generated from the PMM server to to something external: check.percona.com (I think) and repo.percona.com respectively in their cases.

It could be your proxy but I believe it’s a wholly different configuration you’d have to have done to force your PMM server to send internally initiated traffic to your proxy (but it can be done).

Easiest way to check this is installing telnet inside the docker/ami/ovf (yum install telnet) and do a telnet check.percona.com 443. You won’t be able to send any real traffic but you’ll at least know if the networking is correct if you get a “Escape character is ^]” message.

Since you mention “edu” in the domain, I would assume that your network team has decided to restrict outbound internet access for whatever network zone your PMM server sits in. You could test that theory by 1)telnetting to something outside the campus network, then 2) something inside the campus network but different zone than your PMM server and then 3) something in the same zone as your PMM server. guessing 3 will work, 2 might, and 1 probably won’t. From there we might be able to help more.

1 Like

Hi @steve.hoffman,

Thanks for the tips. The “proxy_*” directives helped with one of the specific errors I had in my proxy-to-proxy config.

To troubleshoot a bit further, I’ve took out the extra Nginx server proxy and config and am now just running the Docker container (I think I’ll just end up giving the container a cert that works, expose it at 443, and call it a day - simpler, I think).

I have two outstanding issues:

401 on: https://pmm-prod-01.delphi.cmu.edu/v1/Platform/UserStatus

  • I figure that is checking to see if I am logged in/connected to a Percona account. I probably don’t care too much about this unless it affects the next one.

503 on: https://pmm-prod-01.delphi.cmu.edu/v1/Updates/Check

I am seeing some pmm-managed.log lines like this:

Loading "changelog" plugin
Loading "fastestmirror" plugin
Loading "ovl" plugin
Config time: 0.012
rpmdb time: 0.000
ovl: Copying up (0) files from OverlayFS lower layer
Yum version: 3.4.3
Building updates object
Setting up Package Sacks
Loading mirror speeds from cached hostfile
 * base: distro.ibiblio.org
 * epel: mirrors.wcupa.edu
 * extras: mirror.cogentco.com
 * updates: mirror.datto.com
. Error: signal: killed  component=supervisord/pmm-update-checker
WARN[2022-09-21T14:48:28.185+00:00] RPC /server.Server/CheckUpdates done in 30.001842808s with gRPC error: rpc error: code = Unavailable desc = failed to check for updates  request=6228e13a-39bc-11ed-af25-0242ac110002

Any thoughts here? I don’t think this is an Nginx config issue as I am now exposing and connecting to the container’s config.

Do I need to be logged in to a Perona account to get updates?

Thanks,

1 Like

Hmm, interesting.

I poked around the forums a bit and saw some references to issues with yum.

I just did yum clear all, which didn’t seem to do anything, but then after a yum list the next browser refresh shows me that I have an update.

About to try the upgrade :crossed_fingers:

1 Like

You are correct on the 401 (just means you’re not logged into the Percona Platform with a Percona account…so as an “anonymous” user you’d get the basic payload of Advisors from our SaaS portal as well as the basic tier of Alerting templates… you could check either advisors or alerting (if enabled) to see and you should have about 10 of each…and that confirms you are getting out to the internet.

The second is actually the call to check for PMM updates and a 503 is puzzling…it’s possible that a mirror mis-fired? You should be able to click the “check now” for updates from the main panel and if a new version is available you’ll see the “upgrade now” option. You do not need to be logged in to get software updates.

It’s actually pretty easy to use your own trusted certs. Here’s the webpage with the details. I just use letsencrypt certs and no more “untrusted browser” in chrome.

1 Like

Thanks, Steve.

It was interesting, and I definitely blame yum to some degree, but I think other chaos was at play as well.

After provoking yum (think it was ultimately clearing the cache via yum clean all, but I also had to issue yum list it seems) I was able to get the update panel to show that there was an update available. I tried to update, but the playbook keep running and failing (I have a log if you are interested). So I just ran the install script again, which worked, and now I’m at 2.30.0. Will see how the next one goes.

And yep, I’ve decided to just let the PMM Nginx proxy do its thing, so I have started the container with the appropriate volume mounts to make the LE certs available to PMM’s Nginx. Works!

Thanks for engaging with my question :bowing_man:

1 Like