In the process of upgrading my PMM docker instance on the production server, I’ve made a type that probably caused (or at least contributed to) all the grief and confusion I’m suffering.
After stopping the v1.12.0 container, renaming it, and pulling the v1.13.0 image, I’ve started the container using ‘percona/pmm-server:1.1.0’ as the image name. There was no such image on the server, in which case according to the manual, the docker would pull the missing image. The image download process didn’t show on the command line (like when you run ‘docker pull’), so I didn’t spot the typo until later, and I’m unsure if this is the cause.
On startup, loading of the “CPU Busy” panels under the “Environment overview” would time-out and throw an error, while the server would choke because its IOPS capacity was being maxed out (all 4 cores reported >70% iowait). Later on, I noticed one of the two prometheus processes would eventually get killed by oom-killer just after its memory consumption rapidly reached 70% (out of 8GB). I’ve removed the ‘bad’ v1.1.0 container and started a new one specifying the correct image.
Strangely, by the next day the server, and PMM running on it got stabilized but now some things were missing - the PMM App somehow got disabled!
After enabling the PMM app/plugin, I noticed that in addition to the singlestat-panel problem, at least the following dashboards were missing: Network Overview, Overview NUMA metrics,… while the “Home Dashboard” displays as on the screenshot attached when loaded via http:///
Removing the container, and re-creating it doesn’t fix the missing dashboards and panel.
I suspect the v1.13.0 image got somehow corrupt, and am considering removal and re-download, but because Docker is just a tool I’m not too familiar with,
I’d appreciate some guidance on how to fix this as painlessly as possible.
Could you provide the contents of the /var/log/dashboard-upgrade.log file please. Or if you don’t have that, could you please run the import dashboards script manually - and share output with us. Thanks.
Hi Lorraine, the best I can do is the ‘screen’ output when I glanced at the log myself. See attached.
I’m afraid the log itself got nuked when I removed the v1.13.0 image and pulled back from the repo.
A silly question: Where is the binary I would run so as to import the dashboards manually?
On a side note: the two missing dashboards (Overview NUMA metrics and Network Overview) I’ve “fixed” by cheating and dumping templated JSON from my staging PMM and imported them into production, and two of the panels I could fix by tweaking their JSON from ‘pmm-signlestat-panel’ to ‘singlestat’… but that’s just a quick-fix I’m not sure will survive the next upgrade, so I’m still looking for the cause of this.
Navigating to Configuration → Plugins → PMM → Dashboards shows re-import options for each dashboard, but those two I imported from the staging instance don’t show.
Seems to me like the problem lies with the PMM app/plugin itself, as if it got corrupted, and consequently auto-disabled - 'cause I sure as hell didn’t disable it.