Psmdb-backup irregularly fails to complete due to various reasons

Michael_Schmid · September 3, 2024, 8:28am

This topic is also cross-posted on GitHub #1637. I’m happy to delete it on either platform if it fits better on the other.

Description:

We’re running several psmdb clusters in multiple kubernetes clusters. For most of them the psmdb-backup failed at least once during their lifetime. We have three psmdb clusters where the backup irregularly fails around once a week.

We found a multitude of seemingly underlying issues that we can’t track down to infrastructure or faulty configuration.

We can’t find indicators for when a backup will fail and can’t see differences when a backup fails versus it working for multiple days back to back again.

The affected clusters differ is storage size and throughput, from some 10MB to a few GB. But nothing in the size of > 100GB.

We first noticed this problem in October 2023.

Symptoms we noticed

pbm-agent hangs

To debug the faulty pods, we open a shell to the pbm-agent container and issue pbm commands. Sometimes the pbm-agent is unresponsive and needs to be killed from inside the container with

kill -2 $(grep -E "^Pid:" $(grep -l pbm-agent /proc/*[0-9]/status) | rev | cut -f1 | rev)

Election winner does nothing

Logs of the backup-agent container state that an election has stated for which agent is responsible for creating a backup. One gets elected and the others respect that, thus not starting a backup. Unfortunately sometimes the elected container also does not start the backup.

Stale lock in database

Mostly in combination with the above issue, sometime backup processes stop during execution and leave a stale lock in the database, preventing subsequent backup jobs from creating new backups.

Starting deadline exceeded

Other times the backup-agent logs Error: starting deadline exceeded also often creating a stale lock in the database.

disk error read/write on closed pipe / multipart upload failed

Please find logs of tonight’s backup below.

Steps to Reproduce:

Create psmdb-cluster (3 nodes for our use-case) with psmdb-operator
Configure daily backups to be written to cloud provider (e.g. AWS S3)
Wait some time
psmdb-backup will fail

Version:

Kubernetes v1.28.11
Operator chart version 1.16.2 / 1.16.3
Database 1.16.3

Logs:

Tonights backup failed with

[...]
2024-09-02T10:07:28.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66d58445d58df8df2b89d091]
2024/09/02 10:07:35 [#################.......]  <database>.<collection>  136762220/190760640  (71.7%)
2024-09-02T10:07:35.760+0000	Mux close namespace <database>.<collection>
2024-09-02T10:07:35.760+0000	Mux finish
2024-09-02T10:07:35.760+0000	archive writer: error writing data for collection `<database>.<collection>` to disk: error writing to file: short write / io: read/write on closed pipe
2024-09-02T10:07:35.000+0000 I [backup/2024-09-02T09:24:21Z] dropping tmp collections
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] created chunk 2024-09-02T10:04:25 - 2024-09-02T10:07:35
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark RS as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark backup as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 E [backup/2024-09-02T09:24:21Z] backup: mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )
2024-09-02T10:07:36.000+0000 D [backup/2024-09-02T09:24:21Z] releasing lock

Expected Result:

psmdb-backup being created consistently.

Actual Result:

psmbd-backup works for some days and then unexpectedly fails. After clearing stale locks (if they are present) and triggering a backup manually, it works again as expected.

Additional Information:

[Include any additional information that could be helpful to diagnose the issue, such as browser or device information]

Topic		Replies	Views
PSMDB Backup Failing [check cluster for dump done: convergeCluster: backup on shard rs2 failed with: %!s(<nil>)] Percona Backup for MongoDB	2	100	April 23, 2025
PSMDB 1.14.0. PBM (backup) exit code -1 Percona Operator for MongoDB	1	960	July 20, 2023
PSMDB Cannot backup cluster: "Waiting for backup metadata" Percona Backup for MongoDB mongodb	2	913	November 26, 2021
Backup fails with spec psmdbCluster field is empty Percona Server for MongoDB	2	638	July 22, 2024
All Backups Error Percona Operator for MongoDB	4	1379	March 21, 2022