We are seeing PMM disconnect on multiple systems hosted in AWS. This results in the loss of our monitoring data (we are mostly concerned with the MySQL Amazon Aurora Metrics) during the periods of time that it is down. Upon rebooting, PMM will remain connected and work as expected for approximately an hour before it will fail and remain disconnected until another reboot is done.
I submitted a JIRA ticket for this issue which can be found here: https://jira.percona.com/browse/PMM-4772 However it was ruled that this is an issue with the settings/ configuration rather than a bug and so was not resolved.
Before closing this issue, Lalit from Percona took a look at the logs I provided at the time and mentioned a recurring error found in the pmm-rds_exporter logs:
level=error msg="Failed to filter log events: AccessDeniedException: User: arn:aws:sts:
And another in the ‘exporter’ logs:
Error scraping for collect.perf_schema.eventswaits: dial tcp: i/o timeout
Neither of which errors, I am able to find today.
Since then, we have upgraded PMM to version 1.17.2 (from 1.17.1) and are still seeing this issue on a daily basis.
Coming back to look at this issue today, I only see the following error in our pmm-rds_exporter log - which I have attached to this thread:
time="2019-12-05T16:40:54Z" level=error msg="Failed to filter log events: ResourceNotFoundException: The specified log group does not exist.\n\tstatus code: 400, request id: 03d37e11-c396-483c-b532-a039827d6d8d." component=enhanced source="scraper.go:109"
Does anybody know what this may be relating to and whether it is contributing to our issue?
Our full log directory from can be downloaded here: [URL=“Dropbox - File Deleted”]Dropbox - Error
At the time of compressing this log directory, pmm had been connected for just under an hour.
Any help would be greatly appreciated.
Thanks