Description:
I have a PMM server running as a Docker container on an EC2 instance. This EC2 instance serves as a replica of a master server running Percona for MySQL 8.0.32, managing approximately 3,000 databases with a total size of 2.4TB.
Issue Details:
The setup was functioning correctly until a recent upgrade of the replica, which failed. To recover, I restored the backup from an AMI.
Problem Encountered:
After the restoration, when starting the PMM agent on the master server, the following error occurs:
failed to get connection service info: timeout (context deadline exceeded)
After multiple attempts, I managed to run the PMM agent using the following command:
pmm-admin add mysql --query-source=slowlog --size-slow-logs=1GiB --skip-connection-check --username=pmmuser --password=Pmm~2024 db1_mysql_live localhost:3306
However, once started, this caused the IO wait to increase to over 50, leading me to stop the agent. Additionally, when attempting to run the agent at night, it consumed all available memory, with mysqld_exporter
utilizing more than 45GB of RAM, necessitating another stop to prevent a server crash.
Request for Assistance:
I seek advice on diagnosing and resolving the issues with the PMM agent and mysqld_exporter
to ensure stable operation without excessive resource consumption.
Thank you for your assistance.
Best regards,
Do you know if the IO activity is heavier on reads or writes and what process(es) are doing the bulk of it? I’m going to guess it’s both reads and writes but if it was one or the other dominating available IO then it could narrow down a culprit. Just as a best practice, we don’t recommend that you co-mingle a production database server with your PMM instance even if it’s just sharing hardware with a replica. While we build PMM to be the more submissive process in terms of resource consumption, high activity on the DB server (i.e. receiving a ton of replication data from a primary node) can cause a compounding effect of PMM using the same disk/cpu/RAM to try and capture and store all the activity.
Some things that jump to mind that can cause incredibly high reads would be a giant slowlog file after the upgrade (i.e. are a ton of transactions trying to be ingested by mysqld_exporter). It’s also possible that table stats can be hurting you but we typically limit that to 2000 tables and disable table stats for anything above (And you should get a message to that effect when you added the node to monitoring).
The original error message you posted indicated some sort of network/name resolution error or service being down…so while the PMM agent was finally able to start, is it able to communicate with the PMM server? if it can’t vmagent will start to cache results locally to disk (in /tmp) up to 1GB but you can see how quickly you can overload IO on a system with all the read/write intensive processing going on in one spot.
I think the memory footprint is a bit of an after effect of IO being backed up and pmm_agent receiving information to process faster than it can get rid of it (or your getting rid of the data on the same infrastructure that’s getting it in the first place).
I hope this gives a few places to look…maybe a little more information about version, type of storage, etc that might give some more clues on what changed post-upgrade.
I tried to install PMM server from AMI on AWS, after I finished the installation I connected the replica server to the new PMM server, I face a high io wait again and it affect our customers
I use these slow query settings in mysql
log_output=file
slow_query_log=ON
long_query_time=0
log_slow_rate_limit=100
log_slow_rate_type=query
log_slow_verbosity=full
log_slow_admin_statements=ON
log_slow_slave_statements=ON
slow_query_log_always_write_time=1
slow_query_log_use_global_control=all
innodb_monitor_enable=all
userstat=1
how can I limit the reources used by mysql exporter or what is the best setting for slow query I can use
Thanks in advance