PMM 2.30.0 backup questions

I am refining my backup process of pmm-server (container with pmm-data volume mounted at /srv). The current pmm2 directions say to stop pmm-server and then docker cp the files into a subdirectory. The docker cp doesn’t preserve file uid:gid, so I’m testing starting a temp busybox container, exec’ing into the container, tar’ing the /srv directory to STDOUT onto a file on the host. (basic testing worked, so far has not worked on an actual pmm data container). Maybe a better approach would be to just tar up the grafana and postgres directories and skip the metrics data.

What makes me unhappy is the “stop pmm-server” portion. Has anybody tried just snapshotting the volume or tarring up /srv on a running instance? Actually used that data backup to recover an instance? I’d like to avoid shutting down pmm-server at $interval for a backup of both config and metrics data.

Any feedback or suggestions?

2 Likes

I have…and it worked but I can’t say my testing was thorough enough to call it safe to do. As the volume of data gets larger and larger it will take longer to get all the data and so the footprint can be changing underneath you as you’re running the backup causing some potential for data corruption or inconsistencies.

The static files part is actually easy to do and quite safe to just cp configs and the like without stopping PMM. The challenge is the databases and getting consistency that won’t hurt you later on (hence the recommendation to stop pmm and quiesce the DB’s) . In my case I probably had some minor corrupt entries in one of the 3 DB’s I just copied the file of, because my copy happened as data was being written but not yet completely committed but they never resulted in app errors.

What I’ve actually been tinkering with is the individual backup mechanism for the 3 databases (Postgres, VictoriaMetrics and Clickhouse) and have a basic pg-dump/import routine that gets a more stabile export of the database without downtime. All well and good but its inside the container that I have to run that routine and then copy a resulting file out somewhere safe.

We’ve also got work going on to try and externalize the DB’s (clickhouse is testable now) and this is where it gets fun: now I can use native DB backup utils to get good stable copies of the database OR (well, I guess and too) I can use each DB’s native HA approach to get data replicated and have a warm DR site at the ready with minimal RPO/RTO!

Last but not least you can look at using bind volumes (instead of a container volume) and if that’s mounted to a SAN you can just use the built-in snapshotting to do it at the block level and that should be rock solid.

I realize this probably isn’t a ton of help now but give you some ideas of what you might be able to do.

2 Likes

Your penultimate paragraph actually spoke to me. I do have a separate EBS volume for /var/lib/docker on this machine. I think one approach I will use will be to snapshot the volume at some interval. I can restore the snapshot and get to that pmm-data volume in an emergency. I care less about any missing metrics data compared to the configs. I was not able to automate adding data sources, dashboards, users, alert rules, contact points, notification policies, so recovering that quickly and easily works best for me.

1 Like