Nodes are unable to connect server: Failed to establish two-way communication channel

For a while now the client nodes but also pmm-agent on the server has been unable to properly connect to the server. Im not sure when the problem started but it has been a problem since at least Sep 27 since that is how far back the log goes. Both the clients and the server is up-to-date with 2.22.0 and the problem existed before that as far as I remember.
The server is deployed from the OVA into kvm and resides on the same private network through a secondary interface as the client so there is no proxy or similar between the clients and server.
This is the pmm-agent log from one of the nodes:
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.103+02:00] Loading configuration file /usr/local/percona/pmm2/config/pmm-agent.yaml. component=main
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.105+02:00] Using /usr/local/percona/pmm2/exporters/node_exporter component=main
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.105+02:00] Using /usr/local/percona/pmm2/exporters/mysqld_exporter component=main
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.105+02:00] Using /usr/local/percona/pmm2/exporters/mongodb_exporter component=main
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.105+02:00] Using /usr/local/percona/pmm2/exporters/postgres_exporter component=main
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.105+02:00] Using /usr/local/percona/pmm2/exporters/proxysql_exporter component=main
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.105+02:00] Using /usr/local/percona/pmm2/exporters/rds_exporter component=main
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.105+02:00] Using /usr/local/percona/pmm2/exporters/azure_exporter component=main
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.105+02:00] Using /usr/local/percona/pmm2/exporters/vmagent component=main
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.105+02:00] Starting… component=client
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.105+02:00] Connecting to https://admin:***@192.168.0.221:443/ … component=client
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.105+02:00] Starting local API server on http://127.0.0.1:7777/ … component=local-server/JSON
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.106+02:00] Started. component=local-server/JSON
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.121+02:00] Connected to 192.168.0.221:443. component=client
Oct 06 08:06:22 node.example.com pmm-agent[12652]: INFO[2021-10-06T08:06:22.121+02:00] Establishing two-way communication channel … component=client
Oct 06 08:06:27 node.example.com pmm-agent[12652]: ERRO[2021-10-06T08:06:27.110+02:00] Failed to establish two-way communication channel: context canceled. component=client

There is a different error when starting up the server. Attached that part of the log pmm-agent.log (180.2 KB)

Any idea what the problem is and how to fix it?

Hello again @fredricj!

Thanks for posting your question here. I have added a comment in the Slack thread to help anyone else to find their way here. I will take a look at the attached log and the OVF version and see if anything stands out.

In the meantime, please could you check and confirm the following:

  1. Communication is possible from the PMM agent nodes to the server, e.g.
    curl -X GET --dump-header /dev/stdout https://<your PMM server address>/v1/readyz
    
  2. Some certificate information from the server
    echo -n . | openssl s_client -connect <your PMM server address>:443 | openssl x509 -noout -dates -subject -issuer
    
  3. The OS distro & version that you are using for the agent nodes

Cheers,

Ceri

1 Like
  1. Returns:
    curl: (60) Peer’s Certificate has expired.

depth=0 O = Main Org.
verify error:num=18:self signed certificate
verify return:1
depth=0 O = Main Org.
verify error:num=10:certificate has expired
notAfter=Feb 3 09:34:27 2021 GMT
verify return:1
depth=0 O = Main Org.
notAfter=Feb 3 09:34:27 2021 GMT
verify return:1
DONE
notBefore=Feb 4 09:34:27 2020 GMT
notAfter=Feb 3 09:34:27 2021 GMT
subject= /O=Main Org.
issuer= /O=Main Org.

  1. RHEL/CentOS 7

The node are setup to use --server-insecure-tls since we are using PMM with the self-signed certificate on the server

1 Like

@fredricj thanks for the extra information.

You have an expired certificate, you should remedy that first of all.

The log shows that there has been a successful connection:

Oct 05 10:47:09 node.example.com pmm-agent[23548]: INFO[2021-10-05T10:47:09.513+02:00] Connecting to https://admin:***@192.168.x.x:443/ ...  component=client                                                
Oct 05 10:47:09 node.example.com pmm-agent[23548]: INFO[2021-10-05T10:47:09.520+02:00] Connected to 192.168.x.x:443.               component=client                                                          
Oct 05 10:47:09 node.example.com pmm-agent[23548]: INFO[2021-10-05T10:47:09.520+02:00] Establishing two-way communication channel ...  component=client                                                        
Oct 05 10:47:14 node.example.com pmm-agent[23548]: ERRO[2021-10-05T10:47:14.517+02:00] Failed to establish two-way communication channel: context canceled.  component=client                                  
Oct 05 10:47:28 node.example.com pmm-agent[23548]: INFO[2021-10-05T10:47:28.632+02:00] Connecting to https://admin:***@192.168.x.x:443/ ...  component=client                                                
Oct 05 10:47:28 node.example.com pmm-agent[23548]: INFO[2021-10-05T10:47:28.640+02:00] Connected to 192.168.x.x:443.               component=client                                                          
Oct 05 10:47:28 node.example.com pmm-agent[23548]: INFO[2021-10-05T10:47:28.640+02:00] Establishing two-way communication channel ...  component=client                                                        
Oct 05 10:47:33 node.example.com pmm-agent[23548]: ERRO[2021-10-05T10:47:33.636+02:00] Failed to establish two-way communication channel: context canceled.  component=client                                  
Oct 05 10:47:50 node.example.com pmm-agent[23548]: INFO[2021-10-05T10:47:50.924+02:00] Connecting to https://admin:***@192.168.x.x:443/ ...  component=client                                                
Oct 05 10:47:50 node.example.com pmm-agent[23548]: INFO[2021-10-05T10:47:50.931+02:00] Connected to 192.168.x.x:443.               component=client                                                          
Oct 05 10:47:50 node.example.com pmm-agent[23548]: INFO[2021-10-05T10:47:50.932+02:00] Establishing two-way communication channel ...  component=client                                                        
Oct 05 10:47:50 node.example.com pmm-agent[23548]: INFO[2021-10-05T10:47:50.940+02:00] Two-way communication channel established in 7.843349ms. Estimated clock drift: 3.533665ms.  component=client 

Have you checked the network performance, server host performance, etc?

1 Like

Updating the certificate helped. Since the initial connection succeeded I didn’t think the certificate would have been an issue. An improved error message would certainly help in this case, by telling you why it canceled.
Thanks for your help

2 Likes

I answered too fast. It actually only helped with the error message on some nodes but not on all.
Network performance shouldn’t be an issue since the main traffic goes on other interfaces and the server load is between 0.1 and 0.4 with most of the CPU is used by victoriametrics

1 Like

Just to be sure, have you tried restarting the agents that are still experiencing an issue?

1 Like

Yes, I restarted it and also tried to remove and setup pmm-agent again

1 Like

@fredricj thanks for informing us about this, we have logged [PMM-8971] Nodes are unable to connect to PMM Server deployed on OVA - Percona JIRA to understand and investigate the issue. It would be a great help for us if you could also provide some additional logs from your setup and attach them to the ticket.

1 Like

@fredricj there is information in the troubleshooting guide that should help you get information for the Jira ticket

1 Like

do you also have pmm logs (pmm-managed)? please attach them to the ticket.
Just complete zip logs.

1 Like

if need to anonymize it: pt-secure-collect — Percona Toolkit Documentation

1 Like