PMM constantly disconnects/ fails and requires reboot to reconnect

We are seeing PMM disconnect on multiple systems hosted in AWS. This results in the loss of our monitoring data (we are mostly concerned with the MySQL Amazon Aurora Metrics) during the periods of time that it is down. Upon rebooting, PMM will remain connected and work as expected for approximately an hour before it will fail and remain disconnected until another reboot is done.

I submitted a JIRA ticket for this issue which can be found here: https://jira.percona.com/browse/PMM-4772 However it was ruled that this is an issue with the settings/ configuration rather than a bug and so was not resolved.

Before closing this issue, Lalit from Percona took a look at the logs I provided at the time and mentioned a recurring error found in the pmm-rds_exporter logs:


level=error msg="Failed to filter log events: AccessDeniedException: User: arn:aws:sts:

And another in the ‘exporter’ logs:


Error scraping for collect.perf_schema.eventswaits: dial tcp: i/o timeout

Neither of which errors, I am able to find today.

Since then, we have upgraded PMM to version 1.17.2 (from 1.17.1) and are still seeing this issue on a daily basis.

Coming back to look at this issue today, I only see the following error in our pmm-rds_exporter log - which I have attached to this thread:

time="2019-12-05T16:40:54Z" level=error msg="Failed to filter log events: ResourceNotFoundException: The specified log group does not exist.\n\tstatus code: 400, request id: 03d37e11-c396-483c-b532-a039827d6d8d." component=enhanced source="scraper.go:109"

Does anybody know what this may be relating to and whether it is contributing to our issue?

Our full log directory from can be downloaded here: https://www.dropbox.com/s/rpa3f7eadp…ectLogs.tar.gz
At the time of compressing this log directory, pmm had been connected for just under an hour.

Any help would be greatly appreciated.

Thanks

I don’t have much experience in sync pmm with Aurora yet, but I want to suggest you to try pmm2. From my perspective it looks more stable and self-configure compare with pmm1

Hi Ealdridge,

Did you review your instance settings as recommended to use PMM monitoring?

https://www.percona.com/doc/percona-monitoring-and-management/amazon-rds.html

An error occurred (ResourceNotFoundException) when calling the GetLogEvents operation: The specified log group does not exist.
you may also look into below reference issue which has a similar issue with AWS (not pmm specific )
https://stackoverflow.com/questions/55436251/aws-logs-the-specified-log-group-does-not-exist