Are there plans to bring this epic up in priority? Our k8s environment is a managed platform where nodes coming in and out is a normal occurrence. Mongo handles this gracefully but for our larger clusters where a backup job can take upwards of 10hrs, the whole job fails. I can file a jira ticket with some logs I’ve captured when this happens, the gist is a pod restarts that happens to be hosting the pbm-agent running the backup for that rs. The other pbm agents dont understand that this happened, and the error message contains nil for the reason of failure.
Maybe if pbm-agent could take the termination signal and send a message somewhere, it could restart when it returns? Or start a backup on both secondaries and pick a winner?