We hit the same issue — full and incremental schedules competing for the pgBackRest lock, failed PerconaPGBackup objects piling up, and eventually all scheduled backups stopping. This persists beyond
v2.4.x — similar behavior has been reported on Can't start backup. Previous backup is still in progress , where a stuck PerconaPGBackup in “Starting” state blocks all
future backups.
Our workaround was to bypass the operator’s native scheduling entirely and use external Kubernetes CronJobs that kubectl exec into the repo-host pod. The key difference: the script checks if a backup
is already running before attempting one, so it never creates a conflicting backup that would generate a stuck PerconaPGBackup object.
Here’s the core logic (same for both full and incremental, just change --type=full to --type=incr):
apiVersion: batch/v1
kind: CronJob
metadata:
name: pgbackrest-full-weekly
namespace: postgres-operator
spec:
schedule: “30 2 * * 0”
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
ttlSecondsAfterFinished: 86400
template:
spec:
serviceAccountName: pgbackrest-cronjob-sa
containers:
- name: backup-trigger
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
set -e
# Find the repo-host pod
REPO_HOST=$(kubectl get pod -n postgres-operator
-l postgres-operator.crunchydata.com/data=pgbackrest
–field-selector=status.phase=Running
-o jsonpath=‘{.items[0].metadata.name}’ 2>/dev/null)
if [ -z “$REPO_HOST” ]; then
echo “SKIP: No running repo-host pod found”
exit 0
fi
# Check stanza is initialized
STANZA_INFO=$(kubectl exec -n postgres-operator "$REPO_HOST" \
-c pgbackrest -- pgbackrest info --stanza=db \
--output=json 2>/dev/null) || {
echo "SKIP: stanza not ready yet"
exit 0
}
# KEY: Check if another backup holds the lock
if echo "$STANZA_INFO" | grep -q '"held":true'; then
echo "SKIP: Another backup is already running"
exit 0
fi
# Safe to run
kubectl exec -n postgres-operator "$REPO_HOST" \ -c pgbackrest -- pgbackrest backup – stanza=db --type=full --log-level-console=info
restartPolicy: OnFailure
You’ll need a ServiceAccount + Role + RoleBinding with pods: [get, list] and pods/exec: [create] in the namespace.
Why this works where native scheduling doesn’t:
- concurrencyPolicy: Forbid — Kubernetes prevents overlapping CronJob runs
- ttlSecondsAfterFinished: 86400 — completed Jobs cleaned up after 24h, no backlog
- successfulJobsHistoryLimit/failedJobsHistoryLimit: 3 — caps Job object history
- The “held”:true lock check — gracefully skips instead of failing and creating stuck PerconaPGBackup objects
- Bypasses operator state tracking entirely — no pgv2.percona.com/backup-in-progress annotation issues
Remove the schedules block from your PerconaPGCluster CR when switching to this approach. We’ve been running this in production (weekly full + 4-hourly incremental to S3) for months with zero backlog
issues.