While creating a normal logical backup of a 1.3TB replica, its state changes to errored after about 16 hours of run with the following:
check for concurrent jobs: getting pbm object: create PBM connection to mongo-rs0-2.mongo-rs0.mongo.svc.cluster.local:27017,mongo-rs0-0.mongo-rs0.mongo.svc.cluster.local:27017,mongo-rs0-1.mong-rs0.mongo.svc.cluster.local:27017: create mongo connection: create mongo client: failed to find CERTIFICATE
The backup job however still progresses and completes, but I’m unsure as to the validity of the backup it creates (and the fact that I doubt it’s possible to restore from that unless manually changing its status to success)
I think it’s possible for it to be related to this, and as @Sergey_Pronin says, the backups are good but indicating a wrong failed status. I haven’t tried restoring from those failed ones though.
Same problem here and in my case, it’s not related to this. I think it is a problem with deploy certificates, i have two environments and i’ve done an operator upgrade from 1.9.0 to 1.14.0, in development it is working as expected, i got backups in S3 but in production, i got this error. Comparing both environments, i’ve only seen that the certificates doesn’t upgrade properly like in development, i say this because in ArgoCD the objects of each app are different, not the same result. In development, i see two secret objects for certificates but they aren’t in production so i’m guessing something is happening there.
@Semantic I’m talking about psmdb-backup. In other words - if you try manual backup through creating a psmdb-backup resource - is it always erroring out or there is a chance that it goes through?
The problem persists with operator 1.15.0 and pbm 2.3.0 - at least for manual backups.
The backup container completed the backup but backup object is in failed state.
hi all, i also got this error on the PSMDB-backup resource but seeing the mongod backup process seem that it’s ok. I run 1.15 operator and mongodb backup 2.3.0
Hi,
Same issue in my case, after two hours backup was successfully created and saved to S3 storage but Backup resource was switched to the error state after ~30 minutes.
As mentioned previously, those backups actually succeed but are marked with a false-positive error state. I believe that since a real solution would involve codebase changes to this certificate error from surfacing in the first place, here’s a not-so-ideal one, but one you can use right now:
You should install the edit-status kube plugin and edit the state of whatever backup you’d like to use.
kubectl edit-status psmdb-backup NAME
Update the status.state field from error to ready, which would allow you to use it for a restore.
I had to write a python service to extract backup statuses directly from pbm-container via command pbm status -o json, and based on that update Backup resources states in Kubernetes. Tricky, but already works for half a year.