Backup-agent container in MongoDB pod holds ghost disk usage until killed

Hi,

We received an alert yesterday for one of our MongoDB pods having disk 90% full.

When we compare the df -h output it doesn’t match the du output for the mountpoint:

df -h /data/db
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  4.5G  548M  90% /data/db
du -schx /data/db
685M  /data/db
685M Total

At first, we couldn’t figure out what is going on, but we did notice that if you delete the MongoDB replset pod when it comes back up, the df output drops down to match the du output.

I tried killing the mongod process to see if it was holding onto some open file handles or something but it was still reporting the discrepancy between df and du.

Then I ran, for example:

% kubectl -n example exec example-mongodb-rs0-2 -c mongod -- df -h /data/db
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  1.3G  3.7G  26% /data/db
% kubectl -n example exec example-mongodb-rs0-2 -c backup-agent -- kill 1  
% kubectl -n example exec example-mongodb-rs0-2 -c mongod -- df -h /data/db
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  718M  4.3G  15% /data/db

and realised that it’s the backup-agent container which is holding something.

Unfortunately, that space is definitely being consumed, I tried to dd a 4GB file to see and it hit ENOSPC very quickly.

I looked in /proc/1/fd etc and couldn’t see anything concerning or obvious.

Our configuration is:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
  name: "{{ mongodb_config.cluster.name }}"
  namespace: {{ app_namespace }} 
  finalizers:
    - percona.com/delete-psmdb-pods-in-order
    - percona.com/delete-psmdb-pvc
spec:
  enableVolumeExpansion: true
  enableExternalVolumeAutoscaling: false
  crVersion: {{ mongodb_config.cluster.version }}
  image: percona/percona-server-mongodb:{{ mongodb_config.images.server }}
  imagePullPolicy: IfNotPresent
  updateStrategy: SmartUpdate
  upgradeOptions:
    versionServiceEndpoint: https://check.percona.com
    apply: disabled
    schedule: "0 2 * * *"
    setFCV: false
  secrets:
    users: "{{ mongodb_config.cluster.name }}-secrets"
  pmm:
    enabled: false
    image: percona/pmm-client:{{ mongodb_config.images.pmm }}
    serverHost: monitoring-service
  replsets:
  - name: rs0
    configuration: |
      security:
        enableEncryption: false
    size: 3
    affinity:
      antiAffinityTopologyKey: "topology.kubernetes.io/zone"
    sidecars:
    - name: metrics
      image: percona/mongodb_exporter:{{ mongodb_config.images.exporter }}
      env:
      - name: EXPORTER_USER
        valueFrom:
          secretKeyRef:
            name: "{{ mongodb_config.cluster.name }}-secrets"
            key: MONGODB_CLUSTER_MONITOR_USER
      - name: EXPORTER_PASS
        valueFrom:
          secretKeyRef:
            name: "{{ mongodb_config.cluster.name }}-secrets"
            key: MONGODB_CLUSTER_MONITOR_PASSWORD
      - name: POD_IP
        valueFrom:
          fieldRef:
            fieldPath: status.podIP
      - name: MONGODB_URI
        value: "mongodb://$(EXPORTER_USER):$(EXPORTER_PASS)@$(POD_IP):27017"
      args: ["--discovering-mode", "--compatible-mode", "--collect-all", "--log.level=warn", "--mongodb.uri=$(MONGODB_URI)"]
    podDisruptionBudget:
      maxUnavailable: 1
    expose:
      enabled: false
    resources:
      limits:
        cpu: "{{ mongodb_config.resources.limits.cpu }}"
        memory: "{{ mongodb_config.resources.limits.memory }}"
      requests:
        cpu: "{{ mongodb_config.resources.requests.cpu }}"
        memory: "{{ mongodb_config.resources.requests.memory }}"
    volumeSpec:
      persistentVolumeClaim:
        storageClassName: "{{ mongodb_config.storage.className }}"
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: {{ mongodb_config.storage.capacity }}
    hidden:
      size: 1
      enabled: false
    nonvoting:
      size: 1
      enabled: false
    arbiter:
      size: 1
      enabled: false
  backup:
    enabled: true
    resources:
      requests:
        memory: "{{ mongodb_config.backup.resources.requests.memory }}"
        cpu: "{{ mongodb_config.backup.resources.requests.cpu }}"
      limits:
        memory: "{{ mongodb_config.backup.resources.limits.memory }}"
        cpu: "{{ mongodb_config.backup.resources.limits.cpu }}"
    image: percona/percona-backup-mongodb:{{ mongodb_config.images.backup }}
    startingDeadlineSeconds: 300
    storages:
      radosgw:
        type: s3
        s3:
          bucket: "{{ mongodb_bucket_name }}"
          credentialsSecret: mongodb-backup
          endpointUrl: "http://objectstore.objectstore.svc.cluster.local"
          prefix: ""
          region: ceph-objectstore
    pitr:
      enabled: true
      compressionType: gzip
      compressionLevel: 6
    tasks:
      - name: daily-backup
        enabled: true
        schedule: "0 1 * * *"
        type: physical
        retention:
          count: 30
          type: count
          deleteFromStorage: true
        storageName: radosgw
        compressionType: gzip
        compressionLevel: 6
  logcollector:
    enabled: false
  unsafeFlags:
    tls: true
  tls:
    mode: disabled

We are using the following:

  • Percona Operator for MongoDB 1.21.1,
  • MongoDB 8.0.12-4 and
  • backup agent 2.11.0

pbm status:

Cluster:
========
rs0:
  - example-mongodb-rs0-0.example-mongodb-rs0.example.svc.cluster.local:27017 [P]: pbm-agent [v2.11.0] OK
  - example-mongodb-rs0-1.example-mongodb-rs0.example.svc.cluster.local:27017 [S]: pbm-agent [v2.11.0] OK
  - example-mongodb-rs0-2.example-mongodb-rs0.example.svc.cluster.local:27017 [S]: pbm-agent [v2.11.0] OK


PITR incremental backup:
========================
Status [ON]
Running members: rs0/example-mongodb-rs0-2.example-mongodb-rs0.example.svc.cluster.local:27017; 

Currently running:
==================
(none)

Backups:
========
S3 ceph-objectstore http://objectstore.objectstore.svc.cluster.local:8080/mongodb-backup-8840c0f9-4186-4fc0-8517-97db01fcb950
  Snapshots:
    2026-01-20T01:00:00Z 48.00MB <physical> success [restore_to_time: 2026-01-20T01:00:02]
    2026-01-19T01:00:00Z 52.51MB <physical> success [restore_to_time: 2026-01-19T01:00:02]
    2026-01-18T01:00:00Z 54.08MB <physical> success [restore_to_time: 2026-01-18T01:00:01]
    2026-01-17T01:00:01Z 53.47MB <physical> success [restore_to_time: 2026-01-17T01:00:03]
    2026-01-16T01:00:00Z 51.36MB <physical> success [restore_to_time: 2026-01-16T01:00:01]
    2026-01-15T01:51:41Z 49.16MB <physical> success [restore_to_time: 2026-01-15T01:51:43]
    2026-01-15T01:51:05Z 48.76MB <physical> success [restore_to_time: 2026-01-15T01:51:08]
    2026-01-15T01:50:29Z 48.52MB <physical> success [restore_to_time: 2026-01-15T01:50:31]
    2026-01-15T01:49:59Z 48.30MB <physical> success [restore_to_time: 2026-01-15T01:50:00]
    2026-01-15T01:49:25Z 48.02MB <physical> success [restore_to_time: 2026-01-15T01:49:27]
    2026-01-15T01:48:55Z 47.80MB <physical> success [restore_to_time: 2026-01-15T01:48:57]
    2026-01-15T01:48:19Z 48.34MB <physical> success [restore_to_time: 2026-01-15T01:48:20]
    2026-01-15T01:47:43Z 48.11MB <physical> success [restore_to_time: 2026-01-15T01:47:45]
    2026-01-15T01:47:12Z 51.38MB <physical> success [restore_to_time: 2026-01-15T01:47:15]
    2026-01-15T01:46:42Z 47.81MB <physical> success [restore_to_time: 2026-01-15T01:46:44]
    2026-01-15T01:46:11Z 47.53MB <physical> success [restore_to_time: 2026-01-15T01:46:12]
    2026-01-15T01:44:33Z 57.57MB <physical> success [restore_to_time: 2026-01-15T01:44:34]
    2026-01-15T01:43:47Z 57.33MB <physical> success [restore_to_time: 2026-01-15T01:43:48]
    2026-01-15T01:43:11Z 57.10MB <physical> success [restore_to_time: 2026-01-15T01:43:12]
    2026-01-01T01:00:00Z 0.00B <physical> failed [ERROR: some of pbm-agents were lost during the backup] [2026-01-15T01:43:11]
    2025-12-31T01:00:00Z 56.64MB <physical> success [restore_to_time: 2025-12-31T01:00:02]
    2025-12-30T01:00:00Z 55.63MB <physical> success [restore_to_time: 2025-12-30T01:00:01]
    2025-12-29T01:00:00Z 55.00MB <physical> success [restore_to_time: 2025-12-29T01:00:02]
    2025-12-28T01:00:00Z 51.40MB <physical> success [restore_to_time: 2025-12-28T01:00:02]
    2025-12-27T01:00:00Z 50.88MB <physical> success [restore_to_time: 2025-12-27T01:00:02]
    2025-12-26T01:00:00Z 47.24MB <physical> success [restore_to_time: 2025-12-26T01:00:02]
    2025-12-25T01:00:00Z 46.98MB <physical> success [restore_to_time: 2025-12-25T01:00:01]
    2025-12-24T01:00:00Z 43.04MB <physical> success [restore_to_time: 2025-12-24T01:00:02]
    2025-12-23T01:00:00Z 42.65MB <physical> success [restore_to_time: 2025-12-23T01:00:02]
    2025-12-22T01:00:00Z 38.63MB <physical> success [restore_to_time: 2025-12-22T01:00:02]
    2025-12-21T01:00:00Z 37.97MB <physical> success [restore_to_time: 2025-12-21T01:00:02]
    2025-12-20T01:00:00Z 34.35MB <physical> success [restore_to_time: 2025-12-20T01:00:02]
  PITR chunks [1.48GB]:
    2025-12-20T01:00:03 - 2026-01-20T02:29:56
    2025-12-19T01:00:50 - 2025-12-20T01:00:02 (no base snapshot)

I couldn’t get pbm logs -x -s D -t 0 to run but here is the output of kubectl logs from a pod which currently is reporting:

% kubectl exec -n example example-mongodb-rs0-1 -c backup-agent -- df -h /data/db   
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  1.3G  3.7G  26% /data/db
% kubectl exec -n example example-mongodb-rs0-1 -c backup-agent -- du -schx /data/db
648M	/data/db
648M	total

logs: gist:1d880a6bc36839e9a3ee205b3359fbee · GitHub

then when I run kill 1

% % kubectl exec -n example example-mongodb-rs0-1 -c backup-agent -- kill 1           
% kubectl exec -n example example-mongodb-rs0-1 -c backup-agent -- df -h /data/db   
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  716M  4.3G  15% /data/db

Hi,

Any chance of getting help with this?

The discrepancy between df (disk free) and du (disk usage) occurs because du only walks the file tree for visible files, while df reports the actual state of the filesystem. When a process holds a file handle open but the file is “deleted” from the directory structure, the space isn’t reclaimed until that process closes the handle or exits.

Before killing the agent next time, run this to see exactly which files are haunting your disk:

kubectl -n example exec <pod-name> -c backup-agent -- lsof +L1

or

kubectl -n example exec <pod-name> -c backup-agent -- ls -la /proc/self/fd | grep 'deleted'

This will likely show large temporary files in the /data/db path (or a sub-path used by PBM) that are marked as (deleted).

Also, please check your RadosGW logs for 408 Request Timeout or 500 errors during the window when the disk usage starts to climb. And… update to the latest versions of the operator, if possible.

Hi @radoslaw.szulgo

I had the same suspicion, but I don’t believe the hypothesis is correct.

Please see this example from one of our clusters, there aren’t any (deleted) file handles open.

We don’t have any 408 or 500 errors in our RGW logs.

We are running the latest operator.

% kubectl -n example-dev exec example-dev-mongodb-rs0-2 -c backup-agent -- df -h /data/db         
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  3.3G  1.7G  66% /data/db

% kubectl -n example-dev exec example-dev-mongodb-rs0-2 -c backup-agent -- ls /proc | grep -E '^[0-9]+'
1
19
41

% kubectl -n example-dev exec example-dev-mongodb-rs0-2 -c backup-agent -- ls -la /proc/1/fd
total 0
dr-x------ 2 mongodb root  7 Jan 30 02:42 .
dr-xr-xr-x 9 mongodb root  0 Jan 30 02:42 ..
lr-x------ 1 mongodb root 64 Feb 23 02:30 0 -> pipe:[1478]
l-wx------ 1 mongodb root 64 Feb 23 02:30 1 -> pipe:[1479]
l-wx------ 1 mongodb root 64 Feb 23 02:30 2 -> pipe:[1480]
lr-x------ 1 mongodb root 64 Feb 23 02:30 3 -> /sys/fs/cgroup/cpu.max
lrwx------ 1 mongodb root 64 Feb 23 02:30 5 -> anon_inode:[eventpoll]
lrwx------ 1 mongodb root 64 Feb 23 02:30 6 -> anon_inode:[eventfd]
lrwx------ 1 mongodb root 64 Feb 23 02:30 9 -> anon_inode:[pidfd]

% kubectl -n example-dev exec example-dev-mongodb-rs0-2 -c backup-agent -- ls -la /proc/19/fd
total 0
dr-x------ 2 mongodb root 21 Feb 23 02:31 .
dr-xr-xr-x 9 mongodb root  0 Jan 30 02:43 ..
lr-x------ 1 mongodb root 64 Feb 23 02:31 0 -> pipe:[1478]
l-wx------ 1 mongodb root 64 Feb 23 02:31 1 -> /dev/null
lrwx------ 1 mongodb root 64 Feb 23 02:31 10 -> socket:[4170]
lrwx------ 1 mongodb root 64 Feb 23 02:31 11 -> socket:[4174]
lrwx------ 1 mongodb root 64 Feb 23 02:31 12 -> socket:[3229]
lrwx------ 1 mongodb root 64 Feb 23 02:31 13 -> socket:[4175]
lrwx------ 1 mongodb root 64 Feb 23 02:31 14 -> socket:[2230]
lrwx------ 1 mongodb root 64 Feb 23 02:31 15 -> socket:[4178]
lrwx------ 1 mongodb root 64 Feb 23 02:31 16 -> socket:[3232]
lrwx------ 1 mongodb root 64 Feb 23 02:31 17 -> socket:[3233]
lrwx------ 1 mongodb root 64 Feb 23 02:31 18 -> socket:[3236]
l-wx------ 1 mongodb root 64 Feb 23 02:31 2 -> pipe:[1480]
lrwx------ 1 mongodb root 64 Feb 23 02:31 20 -> socket:[240547]
lr-x------ 1 mongodb root 64 Feb 23 02:31 3 -> /sys/fs/cgroup/cpu.max
lrwx------ 1 mongodb root 64 Feb 23 02:31 4 -> socket:[4172]
lrwx------ 1 mongodb root 64 Feb 23 02:31 48 -> socket:[25589586]
lrwx------ 1 mongodb root 64 Feb 23 02:31 5 -> anon_inode:[eventpoll]
lrwx------ 1 mongodb root 64 Feb 23 02:31 6 -> anon_inode:[eventfd]
lrwx------ 1 mongodb root 64 Feb 23 02:31 7 -> socket:[4173]
lrwx------ 1 mongodb root 64 Feb 23 02:31 8 -> socket:[15151455]
lrwx------ 1 mongodb root 64 Feb 23 02:31 9 -> socket:[15151457]

% kubectl -n example-dev exec example-dev-mongodb-rs0-2 -c backup-agent -- kill 1

% kubectl -n example-dev exec example-dev-mongodb-rs0-2 -c backup-agent -- df -h /data/db
Filesystem      Size  Used Avail Use% Mounted on
none            5.0G  762M  4.3G  15% /data/db

Can you check also this before killing?

kubectl exec -n example-dev -c backup-agent – find /proc/*/fd -ls | grep ‘(deleted)’

and

kubectl exec -n example-dev <pod> -c backup-agent -- du -sh /tmp

What you can also do is to either increase the storage (5GB might be too low) or you can try to add emptyDir mount ot the backup-agent container to load off primary /data/db volume.

We’ll have soon operator 1.22.0 that better handles S3-compatible storages - please also give it a try via dedicated minio storage type.

Thanks @radoslaw.szulgo

% kubectl -n example exec example-mongodb-rs0-0 -c backup-agent -- df -h /data/db   
Filesystem   Size Used Avail Use% Mounted on
none      5.0G 2.2G 2.9G 44% /data/db 
% kubectl -n example exec example-mongodb-rs0-0 -c backup-agent -- du -sh /tmp
4.0K  /tmp
% kubectl -n example exec example-mongodb-rs0-0 -c backup-agent -- sh -c "find /proc/*/fd -ls | wc -l"
58
% kubectl -n example exec example-mongodb-rs0-0 -c backup-agent -- sh -c "find /proc/*/fd -ls | grep del"
command terminated with exit code 1
% kubectl --context=rb-ams -n gebit-test-ams exec gebit-test-ams-mongodb-rs0-0 -c backup-agent -- kill 1
% kubectl -n example exec example-mongodb-rs0-0 -c backup-agent -- df -h /data/db
Filesystem   Size Used Avail Use% Mounted on
none      5.0G 817M 4.2G 17% /data/db 

(I did also examine the output of the find command manually to confirm it wasn’t an issue with my grep or find etc, I can confirm there are no open file handles).

What you can also do is to either increase the storage (5GB might be too low)

these are just nonprod instances I am using to show the issue, the same issue is present regardless of disk size and eventually the disk does fill up even on much larger PVC.

or you can try to add emptyDir mount ot the backup-agent container to load off primary /data/db volume.

not sure I understood what you mean here, can you clarify? If we put emptyDir mount to /data/db how would the agent access the data?

We’ll have soon operator 1.22.0 that better handles S3-compatible storages - please also give it a try via dedicated minio storage type.

will of course upgrade as soon as it’s available. Is there nothing else I can do for debugging this issue?