I am refining my backup process of pmm-server (container with pmm-data volume mounted at /srv). The current pmm2 directions say to stop pmm-server and then docker cp the files into a subdirectory. The docker cp doesn’t preserve file uid:gid, so I’m testing starting a temp busybox container, exec’ing into the container, tar’ing the /srv directory to STDOUT onto a file on the host. (basic testing worked, so far has not worked on an actual pmm data container). Maybe a better approach would be to just tar up the grafana and postgres directories and skip the metrics data.
What makes me unhappy is the “stop pmm-server” portion. Has anybody tried just snapshotting the volume or tarring up /srv on a running instance? Actually used that data backup to recover an instance? I’d like to avoid shutting down pmm-server at $interval for a backup of both config and metrics data.
Any feedback or suggestions?
2 Likes
I have…and it worked but I can’t say my testing was thorough enough to call it safe to do. As the volume of data gets larger and larger it will take longer to get all the data and so the footprint can be changing underneath you as you’re running the backup causing some potential for data corruption or inconsistencies.
The static files part is actually easy to do and quite safe to just cp configs and the like without stopping PMM. The challenge is the databases and getting consistency that won’t hurt you later on (hence the recommendation to stop pmm and quiesce the DB’s) . In my case I probably had some minor corrupt entries in one of the 3 DB’s I just copied the file of, because my copy happened as data was being written but not yet completely committed but they never resulted in app errors.
What I’ve actually been tinkering with is the individual backup mechanism for the 3 databases (Postgres, VictoriaMetrics and Clickhouse) and have a basic pg-dump/import routine that gets a more stabile export of the database without downtime. All well and good but its inside the container that I have to run that routine and then copy a resulting file out somewhere safe.
We’ve also got work going on to try and externalize the DB’s (clickhouse is testable now) and this is where it gets fun: now I can use native DB backup utils to get good stable copies of the database OR (well, I guess and too) I can use each DB’s native HA approach to get data replicated and have a warm DR site at the ready with minimal RPO/RTO!
Last but not least you can look at using bind volumes (instead of a container volume) and if that’s mounted to a SAN you can just use the built-in snapshotting to do it at the block level and that should be rock solid.
I realize this probably isn’t a ton of help now but give you some ideas of what you might be able to do.
2 Likes
Your penultimate paragraph actually spoke to me. I do have a separate EBS volume for /var/lib/docker on this machine. I think one approach I will use will be to snapshot the volume at some interval. I can restore the snapshot and get to that pmm-data volume in an emergency. I care less about any missing metrics data compared to the configs. I was not able to automate adding data sources, dashboards, users, alert rules, contact points, notification policies, so recovering that quickly and easily works best for me.
1 Like
I know it’s been a while but your problem bothered me too (“stop pmm server just to take a backup” )…enough that I took a stab at a working prototype of a backup and restore solution for PMM.
I wrote it with a few things in mind:
- I wanted it to work no matter what method of PMM install you used (Docker, AMI, OVF, K8s)
- Need to be able to run it without taking down PMM
- Stick to the “do no harm” mindset where we don’t want to overload a server to do a task
- Needed to be something you could schedule in cron or run right before you performed an update
You may already have solved this another way (and I acknowledge that there are better backup solutions that are unique to your setup (i.e. vm snapshots, block-level snapshots, etc) but if you want to give it a look over and maybe even give it a try, I put in some stuff to allow you to specify a storage location that you may have mounted. If I can get enough feedback on it and it looks like a good general solution, I’ll work with the team to get the pre-reqs installed by default and the utility included in PMM as well.
Personally I’ve been using it for nightly backups, restoring data to a test system to do breaking things to and even figuring out how we can enable OVF/AMI users to more easily upgrade PMM if their instance doesn’t have public internet access (that’s still WIP).