PerconaServerMongoDBBackup | psmdb operator | server selection error

Hello,
psmdb-backup-status.txt (3.1 KB)

We are having a single shard mongodb cluster deployed in kubernetes with psmdb-operator-1.16.2. We have the backups enabled and I notice backups run successful for small data size of 5-10 GB. However when backup runs little longer for DB size more than 50G we start getting “server selection error” followed by “socket was unexpectedly closed” for cfg replset.

We face this issue with both the backup types logical as well as physical.

Below is the backup config yaml

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
  finalizers:
  - percona.com/delete-backup
  name: manual-bkp-test1-20241122
spec:
  clusterName: test1
  storageName: gcs
  type: physical

Attached is the psmdb-backup status where you will find the error details. On checking backup-agent logs for cfg and rs0 replset, we see backup gets successfully completed but it’s just that the operator shows the backup status as error

$ kubectl get psmdb-backup manual-bkp-test1-20241122
NAME                        CLUSTER    STORAGE   DESTINATION                                                   TYPE       STATUS   COMPLETED   AGE
manual-bkp-test1-20241122   test1      gcs       s3://test1-mdb-backups/test1-test1-mdb/2024-11-22T18:23:53Z   physical   error                97m

Is there any configuration in backups where we can increase socket timeout to overcome this error?

Thanks

Harsh

@hparikh

Did you check the health of the backend DB nodes around the corner if no such network or other hiccups/blockers ?

May be you can share the Mongo POD logs and information for a insight ?

kubectl logs pod_name
kubectl describe pod pod_name

Is there any specific time period when this issue happening ?

Can you please share the below information as well to check more on PBM side ?

sudo pbm status > pbm.status;
sudo pbm list > list.out;
sudo pbm logs -t0 > logs.out;
sudo pbm version > version.out;
sudo pbm config --list > conf.out;

How much resources are allocated for memory/CPU ?

Hi @anil.joshi

Thank you for your kind feedback. We checked the health of mongodb cluster for cfg, rs0 and mongos, they are all running healthy. We don’t see any errors in mongod pods (containers) that can cause backup agent to get socket time out.

This issue happens everytime when we run the backups.

Please find attached all the output requested, if there is anything else that you think is required kindly let me know.

pbm_queries.log (24.0 KB)
pod-describe.log (12.1 KB)
psmdb-yaml.log (10.4 KB)

Best,

Harsh