Backup-agent container in MongoDB pod holds ghost disk usage until killed

Hi,

We received an alert yesterday for one of our MongoDB pods having disk 90% full.

When we compare the df -h output it doesn’t match the du output for the mountpoint:

df -h /data/db
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  4.5G  548M  90% /data/db
du -schx /data/db
685M  /data/db
685M Total

At first, we couldn’t figure out what is going on, but we did notice that if you delete the MongoDB replset pod when it comes back up, the df output drops down to match the du output.

I tried killing the mongod process to see if it was holding onto some open file handles or something but it was still reporting the discrepancy between df and du.

Then I ran, for example:

% kubectl -n example exec example-mongodb-rs0-2 -c mongod -- df -h /data/db
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  1.3G  3.7G  26% /data/db
% kubectl -n example exec example-mongodb-rs0-2 -c backup-agent -- kill 1  
% kubectl -n example exec example-mongodb-rs0-2 -c mongod -- df -h /data/db
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  718M  4.3G  15% /data/db

and realised that it’s the backup-agent container which is holding something.

Unfortunately, that space is definitely being consumed, I tried to dd a 4GB file to see and it hit ENOSPC very quickly.

I looked in /proc/1/fd etc and couldn’t see anything concerning or obvious.

Our configuration is:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: "{{ mongodb_config.cluster.name }}"
  namespace: {{ app_namespace }} 
  finalizers:
    - percona.com/delete-psmdb-pods-in-order
    - percona.com/delete-psmdb-pvc
spec:
  enableVolumeExpansion: true
  enableExternalVolumeAutoscaling: false
  crVersion: {{ mongodb_config.cluster.version }}
  image: percona/percona-server-mongodb:{{ mongodb_config.images.server }}
  imagePullPolicy: IfNotPresent
  updateStrategy: SmartUpdate
  upgradeOptions:
    versionServiceEndpoint: https://check.percona.com
    apply: disabled
    schedule: "0 2 * * *"
    setFCV: false
  secrets:
    users: "{{ mongodb_config.cluster.name }}-secrets"
  pmm:
    enabled: false
    image: percona/pmm-client:{{ mongodb_config.images.pmm }}
    serverHost: monitoring-service
  replsets:
  - name: rs0
    configuration: |
      security:
        enableEncryption: false
    size: 3
    affinity:
      antiAffinityTopologyKey: "topology.kubernetes.io/zone"
    sidecars:
    - name: metrics
      image: percona/mongodb_exporter:{{ mongodb_config.images.exporter }}
      env:
      - name: EXPORTER_USER
        valueFrom:
          secretKeyRef:
            name: "{{ mongodb_config.cluster.name }}-secrets"
            key: MONGODB_CLUSTER_MONITOR_USER
      - name: EXPORTER_PASS
        valueFrom:
          secretKeyRef:
            name: "{{ mongodb_config.cluster.name }}-secrets"
            key: MONGODB_CLUSTER_MONITOR_PASSWORD
      - name: POD_IP
        valueFrom:
          fieldRef:
            fieldPath: status.podIP
      - name: MONGODB_URI
        value: "mongodb://$(EXPORTER_USER):$(EXPORTER_PASS)@$(POD_IP):27017"
      args: ["--discovering-mode", "--compatible-mode", "--collect-all", "--log.level=warn", "--mongodb.uri=$(MONGODB_URI)"]
    podDisruptionBudget:
      maxUnavailable: 1
    expose:
      enabled: false
    resources:
      limits:
        cpu: "{{ mongodb_config.resources.limits.cpu }}"
        memory: "{{ mongodb_config.resources.limits.memory }}"
      requests:
        cpu: "{{ mongodb_config.resources.requests.cpu }}"
        memory: "{{ mongodb_config.resources.requests.memory }}"
    volumeSpec:
      persistentVolumeClaim:
        storageClassName: "{{ mongodb_config.storage.className }}"
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: {{ mongodb_config.storage.capacity }}
    hidden:
      size: 1
      enabled: false
    nonvoting:
      size: 1
      enabled: false
    arbiter:
      size: 1
      enabled: false
  backup:
    enabled: true
    resources:
      requests:
        memory: "{{ mongodb_config.backup.resources.requests.memory }}"
        cpu: "{{ mongodb_config.backup.resources.requests.cpu }}"
      limits:
        memory: "{{ mongodb_config.backup.resources.limits.memory }}"
        cpu: "{{ mongodb_config.backup.resources.limits.cpu }}"
    image: percona/percona-backup-mongodb:{{ mongodb_config.images.backup }}
    startingDeadlineSeconds: 300
    storages:
      radosgw:
        type: s3
        s3:
          bucket: "{{ mongodb_bucket_name }}"
          credentialsSecret: mongodb-backup
          endpointUrl: "http://objectstore.objectstore.svc.cluster.local"
          prefix: ""
          region: ceph-objectstore
    pitr:
      enabled: true
      compressionType: gzip
      compressionLevel: 6
    tasks:
      - name: daily-backup
        enabled: true
        schedule: "0 1 * * *"
        type: physical
        retention:
          count: 30
          type: count
          deleteFromStorage: true
        storageName: radosgw
        compressionType: gzip
        compressionLevel: 6
  logcollector:
    enabled: false
  unsafeFlags:
    tls: true
  tls:
    mode: disabled

We are using the following:

  • Percona Operator for MongoDB 1.21.1,
  • MongoDB 8.0.12-4 and
  • backup agent 2.11.0

pbm status:

Cluster:
========
rs0:
  - example-mongodb-rs0-0.example-mongodb-rs0.example.svc.cluster.local:27017 [P]: pbm-agent [v2.11.0] OK
  - example-mongodb-rs0-1.example-mongodb-rs0.example.svc.cluster.local:27017 [S]: pbm-agent [v2.11.0] OK
  - example-mongodb-rs0-2.example-mongodb-rs0.example.svc.cluster.local:27017 [S]: pbm-agent [v2.11.0] OK


PITR incremental backup:
========================
Status [ON]
Running members: rs0/example-mongodb-rs0-2.example-mongodb-rs0.example.svc.cluster.local:27017; 

Currently running:
==================
(none)

Backups:
========
S3 ceph-objectstore http://objectstore.objectstore.svc.cluster.local:8080/mongodb-backup-8840c0f9-4186-4fc0-8517-97db01fcb950
  Snapshots:
    2026-01-20T01:00:00Z 48.00MB <physical> success [restore_to_time: 2026-01-20T01:00:02]
    2026-01-19T01:00:00Z 52.51MB <physical> success [restore_to_time: 2026-01-19T01:00:02]
    2026-01-18T01:00:00Z 54.08MB <physical> success [restore_to_time: 2026-01-18T01:00:01]
    2026-01-17T01:00:01Z 53.47MB <physical> success [restore_to_time: 2026-01-17T01:00:03]
    2026-01-16T01:00:00Z 51.36MB <physical> success [restore_to_time: 2026-01-16T01:00:01]
    2026-01-15T01:51:41Z 49.16MB <physical> success [restore_to_time: 2026-01-15T01:51:43]
    2026-01-15T01:51:05Z 48.76MB <physical> success [restore_to_time: 2026-01-15T01:51:08]
    2026-01-15T01:50:29Z 48.52MB <physical> success [restore_to_time: 2026-01-15T01:50:31]
    2026-01-15T01:49:59Z 48.30MB <physical> success [restore_to_time: 2026-01-15T01:50:00]
    2026-01-15T01:49:25Z 48.02MB <physical> success [restore_to_time: 2026-01-15T01:49:27]
    2026-01-15T01:48:55Z 47.80MB <physical> success [restore_to_time: 2026-01-15T01:48:57]
    2026-01-15T01:48:19Z 48.34MB <physical> success [restore_to_time: 2026-01-15T01:48:20]
    2026-01-15T01:47:43Z 48.11MB <physical> success [restore_to_time: 2026-01-15T01:47:45]
    2026-01-15T01:47:12Z 51.38MB <physical> success [restore_to_time: 2026-01-15T01:47:15]
    2026-01-15T01:46:42Z 47.81MB <physical> success [restore_to_time: 2026-01-15T01:46:44]
    2026-01-15T01:46:11Z 47.53MB <physical> success [restore_to_time: 2026-01-15T01:46:12]
    2026-01-15T01:44:33Z 57.57MB <physical> success [restore_to_time: 2026-01-15T01:44:34]
    2026-01-15T01:43:47Z 57.33MB <physical> success [restore_to_time: 2026-01-15T01:43:48]
    2026-01-15T01:43:11Z 57.10MB <physical> success [restore_to_time: 2026-01-15T01:43:12]
    2026-01-01T01:00:00Z 0.00B <physical> failed [ERROR: some of pbm-agents were lost during the backup] [2026-01-15T01:43:11]
    2025-12-31T01:00:00Z 56.64MB <physical> success [restore_to_time: 2025-12-31T01:00:02]
    2025-12-30T01:00:00Z 55.63MB <physical> success [restore_to_time: 2025-12-30T01:00:01]
    2025-12-29T01:00:00Z 55.00MB <physical> success [restore_to_time: 2025-12-29T01:00:02]
    2025-12-28T01:00:00Z 51.40MB <physical> success [restore_to_time: 2025-12-28T01:00:02]
    2025-12-27T01:00:00Z 50.88MB <physical> success [restore_to_time: 2025-12-27T01:00:02]
    2025-12-26T01:00:00Z 47.24MB <physical> success [restore_to_time: 2025-12-26T01:00:02]
    2025-12-25T01:00:00Z 46.98MB <physical> success [restore_to_time: 2025-12-25T01:00:01]
    2025-12-24T01:00:00Z 43.04MB <physical> success [restore_to_time: 2025-12-24T01:00:02]
    2025-12-23T01:00:00Z 42.65MB <physical> success [restore_to_time: 2025-12-23T01:00:02]
    2025-12-22T01:00:00Z 38.63MB <physical> success [restore_to_time: 2025-12-22T01:00:02]
    2025-12-21T01:00:00Z 37.97MB <physical> success [restore_to_time: 2025-12-21T01:00:02]
    2025-12-20T01:00:00Z 34.35MB <physical> success [restore_to_time: 2025-12-20T01:00:02]
  PITR chunks [1.48GB]:
    2025-12-20T01:00:03 - 2026-01-20T02:29:56
    2025-12-19T01:00:50 - 2025-12-20T01:00:02 (no base snapshot)

I couldn’t get pbm logs -x -s D -t 0 to run but here is the output of kubectl logs from a pod which currently is reporting:

% kubectl exec -n example example-mongodb-rs0-1 -c backup-agent -- df -h /data/db   
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  1.3G  3.7G  26% /data/db
% kubectl exec -n example example-mongodb-rs0-1 -c backup-agent -- du -schx /data/db
648M	/data/db
648M	total

logs: gist:1d880a6bc36839e9a3ee205b3359fbee · GitHub

then when I run kill 1

% % kubectl exec -n example example-mongodb-rs0-1 -c backup-agent -- kill 1           
% kubectl exec -n example example-mongodb-rs0-1 -c backup-agent -- df -h /data/db   
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  716M  4.3G  15% /data/db

Hi,

Any chance of getting help with this?