This topic is also cross-posted on GitHub #1637. I’m happy to delete it on either platform if it fits better on the other.
Description:
We’re running several psmdb clusters in multiple kubernetes clusters. For most of them the psmdb-backup failed at least once during their lifetime. We have three psmdb clusters where the backup irregularly fails around once a week.
We found a multitude of seemingly underlying issues that we can’t track down to infrastructure or faulty configuration.
We can’t find indicators for when a backup will fail and can’t see differences when a backup fails versus it working for multiple days back to back again.
The affected clusters differ is storage size and throughput, from some 10MB to a few GB. But nothing in the size of > 100GB.
We first noticed this problem in October 2023.
Symptoms we noticed
pbm-agent hangs
To debug the faulty pods, we open a shell to the pbm-agent container and issue pbm
commands. Sometimes the pbm-agent is unresponsive and needs to be killed from inside the container with
kill -2 $(grep -E "^Pid:" $(grep -l pbm-agent /proc/*[0-9]/status) | rev | cut -f1 | rev)
Election winner does nothing
Logs of the backup-agent container state that an election has stated for which agent is responsible for creating a backup. One gets elected and the others respect that, thus not starting a backup. Unfortunately sometimes the elected container also does not start the backup.
Stale lock in database
Mostly in combination with the above issue, sometime backup processes stop during execution and leave a stale lock in the database, preventing subsequent backup jobs from creating new backups.
Starting deadline exceeded
Other times the backup-agent logs Error: starting deadline exceeded
also often creating a stale lock in the database.
disk error read/write on closed pipe / multipart upload failed
Please find logs of tonight’s backup below.
Steps to Reproduce:
- Create psmdb-cluster (3 nodes for our use-case) with psmdb-operator
- Configure daily backups to be written to cloud provider (e.g. AWS S3)
- Wait some time
- psmdb-backup will fail
Version:
- Kubernetes v1.28.11
- Operator chart version 1.16.2 / 1.16.3
- Database 1.16.3
Logs:
Tonights backup failed with
[...]
2024-09-02T10:07:28.000+0000 I [pitr] oplog slicer is paused for lock [Snapshot backup, opid: 66d58445d58df8df2b89d091]
2024/09/02 10:07:35 [#################.......] <database>.<collection> 136762220/190760640 (71.7%)
2024-09-02T10:07:35.760+0000 Mux close namespace <database>.<collection>
2024-09-02T10:07:35.760+0000 Mux finish
2024-09-02T10:07:35.760+0000 archive writer: error writing data for collection `<database>.<collection>` to disk: error writing to file: short write / io: read/write on closed pipe
2024-09-02T10:07:35.000+0000 I [backup/2024-09-02T09:24:21Z] dropping tmp collections
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] created chunk 2024-09-02T10:04:25 - 2024-09-02T10:07:35
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark RS as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 I [backup/2024-09-02T09:24:21Z] mark backup as error `mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )`: <nil>
2024-09-02T10:07:36.000+0000 E [backup/2024-09-02T09:24:21Z] backup: mongodump: decompose: archive parser: corruption found in archive; ParserConsumer.BodyBSON() ( "<database>.<collection>": write: upload: "<database>.<collection>": upload to S3: MultipartUpload: upload multipart failed
upload id: <upload-id>
caused by: RequestError: send request failed
caused by: Put "<upload-url>": write tcp <upload-source>-><upload-target>: write: connection timed out )
2024-09-02T10:07:36.000+0000 D [backup/2024-09-02T09:24:21Z] releasing lock
Expected Result:
psmdb-backup being created consistently.
Actual Result:
psmbd-backup works for some days and then unexpectedly fails. After clearing stale locks (if they are present) and triggering a backup manually, it works again as expected.
Additional Information:
[Include any additional information that could be helpful to diagnose the issue, such as browser or device information]