Psmdb-backup irregularly fails to complete due to various reasons

This topic is also cross-posted on GitHub #1637. I’m happy to delete it on either platform if it fits better on the other.

Description:

We’re running several psmdb clusters in multiple kubernetes clusters. For most of them the psmdb-backup failed at least once during their lifetime. We have three psmdb clusters where the backup irregularly fails around once a week.

We found a multitude of seemingly underlying issues that we can’t track down to infrastructure or faulty configuration.

We can’t find indicators for when a backup will fail and can’t see differences when a backup fails versus it working for multiple days back to back again.

The affected clusters differ is storage size and throughput, from some 10MB to a few GB. But nothing in the size of > 100GB.

We first noticed this problem in October 2023.

Symptoms we noticed

pbm-agent hangs

To debug the faulty pods, we open a shell to the pbm-agent container and issue pbm commands. Sometimes the pbm-agent is unresponsive and needs to be killed from inside the container with

kill -2 $(grep -E "^Pid:" $(grep -l pbm-agent /proc/*[0-9]/status) | rev | cut -f1 | rev)

Election winner does nothing

Logs of the backup-agent container state that an election has stated for which agent is responsible for creating a backup. One gets elected and the others respect that, thus not starting a backup. Unfortunately sometimes the elected container also does not start the backup.

Stale lock in database

Mostly in combination with the above issue, sometime backup processes stop during execution and leave a stale lock in the database, preventing subsequent backup jobs from creating new backups.

Starting deadline exceeded

Other times the backup-agent logs Error: starting deadline exceeded also often creating a stale lock in the database.

disk error read/write on closed pipe / multipart upload failed

Please find logs of tonight’s backup below.

Steps to Reproduce:

  1. Create psmdb-cluster (3 nodes for our use-case) with psmdb-operator
  2. Configure daily backups to be written to cloud provider (e.g. AWS S3)
  3. Wait some time
  4. psmdb-backup will fail

Version:

  1. Kubernetes v1.28.11
  2. Operator chart version 1.16.2 / 1.16.3
  3. Database 1.16.3

Logs:

Tonights backup failed with

[...]
2024-09-02T10:07:28.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66d58445d58df8df2b89d091]
2024/09/02 10:07:35 [#################.......]  <database>.<collection>  136762220/190760640  (71.7%)
2024-09-02T10:07:35.760+0000	Mux close namespace <database>.<collection>
2024-09-02T10:07:35.760+0000	Mux finish
2024-09-02T10:07:35.760+0000	archive writer: error writing data for collection `<database>.<collection>` to disk: error writing to file: short write / io: read/write on closed pipe
2024-09-02T10:07:35.000+0000 I [backup/2024-09-02T09:24:21Z] dropping tmp collections
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] created chunk 2024-09-02T10:04:25 - 2024-09-02T10:07:35
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark RS as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark backup as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 E [backup/2024-09-02T09:24:21Z] backup: mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
	upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )
2024-09-02T10:07:36.000+0000 D [backup/2024-09-02T09:24:21Z] releasing lock

Expected Result:

psmdb-backup being created consistently.

Actual Result:

psmbd-backup works for some days and then unexpectedly fails. After clearing stale locks (if they are present) and triggering a backup manually, it works again as expected.

Additional Information:

[Include any additional information that could be helpful to diagnose the issue, such as browser or device information]