Issues with PMM Server on Docker - EC2 Instance Replicating Percona MySQL 8.0.32

khaled1 · July 17, 2024, 7:01am

Description:

I have a PMM server running as a Docker container on an EC2 instance. This EC2 instance serves as a replica of a master server running Percona for MySQL 8.0.32, managing approximately 3,000 databases with a total size of 2.4TB.

Issue Details:

The setup was functioning correctly until a recent upgrade of the replica, which failed. To recover, I restored the backup from an AMI.

Problem Encountered:

After the restoration, when starting the PMM agent on the master server, the following error occurs:

failed to get connection service info: timeout (context deadline exceeded)

After multiple attempts, I managed to run the PMM agent using the following command:

pmm-admin add mysql --query-source=slowlog --size-slow-logs=1GiB --skip-connection-check --username=pmmuser --password=Pmm~2024 db1_mysql_live localhost:3306

However, once started, this caused the IO wait to increase to over 50, leading me to stop the agent. Additionally, when attempting to run the agent at night, it consumed all available memory, with mysqld_exporter utilizing more than 45GB of RAM, necessitating another stop to prevent a server crash.

Request for Assistance:

I seek advice on diagnosing and resolving the issues with the PMM agent and mysqld_exporter to ensure stable operation without excessive resource consumption.

Thank you for your assistance.

Best regards,

steve.hoffman · July 17, 2024, 11:52am

Do you know if the IO activity is heavier on reads or writes and what process(es) are doing the bulk of it? I’m going to guess it’s both reads and writes but if it was one or the other dominating available IO then it could narrow down a culprit. Just as a best practice, we don’t recommend that you co-mingle a production database server with your PMM instance even if it’s just sharing hardware with a replica. While we build PMM to be the more submissive process in terms of resource consumption, high activity on the DB server (i.e. receiving a ton of replication data from a primary node) can cause a compounding effect of PMM using the same disk/cpu/RAM to try and capture and store all the activity.

Some things that jump to mind that can cause incredibly high reads would be a giant slowlog file after the upgrade (i.e. are a ton of transactions trying to be ingested by mysqld_exporter). It’s also possible that table stats can be hurting you but we typically limit that to 2000 tables and disable table stats for anything above (And you should get a message to that effect when you added the node to monitoring).

The original error message you posted indicated some sort of network/name resolution error or service being down…so while the PMM agent was finally able to start, is it able to communicate with the PMM server? if it can’t vmagent will start to cache results locally to disk (in /tmp) up to 1GB but you can see how quickly you can overload IO on a system with all the read/write intensive processing going on in one spot.

I think the memory footprint is a bit of an after effect of IO being backed up and pmm_agent receiving information to process faster than it can get rid of it (or your getting rid of the data on the same infrastructure that’s getting it in the first place).

I hope this gives a few places to look…maybe a little more information about version, type of storage, etc that might give some more clues on what changed post-upgrade.

khaled1 · July 22, 2024, 9:44am

I tried to install PMM server from AMI on AWS, after I finished the installation I connected the replica server to the new PMM server, I face a high io wait again and it affect our customers

I use these slow query settings in mysql

log_output=file
slow_query_log=ON
long_query_time=0
log_slow_rate_limit=100
log_slow_rate_type=query
log_slow_verbosity=full
log_slow_admin_statements=ON
log_slow_slave_statements=ON
slow_query_log_always_write_time=1
slow_query_log_use_global_control=all
innodb_monitor_enable=all
userstat=1

how can I limit the reources used by mysql exporter or what is the best setting for slow query I can use

Thanks in advance

Topic		Replies	Views
Getting this error "Connection check failed: timeout (context deadline exceeded)" while trying to add mysql using pmm-admin Percona XtraDB Cluster 5.x	3	1038	September 27, 2023
Am using Percona 2.43.2 - MySQL Replicaiton summary for mysql 8.4 is not reflecting PMM 2.x	5	175	January 29, 2025
Pmm-agent and insane memory usage PMM 2.x	5	1905	December 16, 2020
High IO Wait on PostgreSQL Master - Possible PMM2 Issue? Percona Monitoring and Management (PMM)	1	61	January 31, 2025
Error adding MySQL queries: timeout 10s waiting on agent to connect to API PMM 1.x	9	1950	October 4, 2020

Issues with PMM Server on Docker - EC2 Instance Replicating Percona MySQL 8.0.32

Related topics