Note: I was unable to create a Jira ticket directly (permission error on create, even when logged in). If a team member could file this under the PSMDB Jira project or enable me to do so, that would be appreciated.
Hi,
We’re hitting a regression in createBackup when streaming to Google Cloud Storage via the S3-compatible API. Every CompleteMultipartUpload request is rejected with HTTP 400.
Affected versions: 8.0.20-8, 8.0.21-9, 8.0.23-10, 7.0.31-17, 7.0.32-18, 7.0.34-19
Last working versions: 8.0.19-7 and 7.0.30-16
The error from the mongod log:
[ERROR] TransferManager: Transfer handle [...] Failed to complete multi-part upload.
Bucket: [our-gcs-bucket]
Key: [backup-path/local/collection/14-3454594855776759712.wt]
HTTP response code: 400
Exception name: MalformedCompleteMultipartUploadRequest
Error message: The complete multipart upload request XML you provided
was not well-formed or did not validate against our published schema.
Single-part uploads (PutObject) succeed. Multipart part uploads (UploadPart) also succeed. Only the CompleteMultipartUpload XML finalization is rejected by GCS. After one file fails, all other in-progress transfers are cancelled and the entire backup aborts.
How we confirmed the regression:
A control instance on 8.0.19-7 backs up successfully to the same GCS bucket, same credentials, same endpoint. Other instances that pulled 8.0-latest (resolved to 8.0.23-10 after a pod restart on June 9) started failing immediately. Same story for 7.0-latest > 7.0.34-19.
For us, approximately 133 MongoDB instances across multiple environments are affected. Different files fail on different instances (collection files, index files, journal files) - confirming the issue is in the XML serialization, not anything file-specific.
Root cause (source code analysis):
PSMDB-1892 upgraded the vendored aws-sdk-cpp from 1.9.379 to 1.11.471 for SBOM compliance. This landed in 8.0.20-8 and 7.0.31-17 — exactly matching the first broken versions.
The new SDK’s TransferManagerConfiguration introduces a checksumAlgorithm field that defaults to CRC32:
// aws-sdk-cpp 1.11.471 — TransferManager.h line 148
Aws::S3::Model::ChecksumAlgorithm checksumAlgorithm = S3::Model::ChecksumAlgorithm::CRC32;
PSMDB’s backup code in wiredtiger_kv_engine.cpp sets computeContentMD5 = true but does not override checksumAlgorithm. During UploadPart, computeContentMD5 correctly overrides the checksum to NOT_SET - so parts upload without requesting CRC32 from the server. GCS doesn’t return any CRC32 header.
However, the part completion callback still checks checksumAlgorithm (still CRC32) and calls GetChecksumCRC32() on the response - getting an empty string "". When assembling the CompleteMultipartUpload XML, SetChecksumCRC32("") is called on each CompletedPart, which unconditionally sets m_checksumCRC32HasBeenSet = true. The XML serializer then emits:
<CompleteMultipartUpload xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Part>
<ETag>"abc123..."</ETag>
<ChecksumCRC32></ChecksumCRC32> <!-- spurious empty element -->
<PartNumber>1</PartNumber>
</Part>
</CompleteMultipartUpload>
GCS rejects this because <ChecksumCRC32> is not part of its CompleteMultipartUpload schema. AWS S3 likely accepts it because it’s more lenient with additional XML elements.
Suggested fix:
TransferManagerConfiguration trManConf(executor.get());
trManConf.s3Client = s3_client;
trManConf.computeContentMD5 = true;
trManConf.checksumAlgorithm = S3::Model::ChecksumAlgorithm::NOT_SET; // add this line
Steps to reproduce:
- Deploy PSMDB 8.0.20-8 or later (or 7.0.31-17 or later)
- Configure
createBackupwith S3 parameters pointing to a GCS bucket viastorage.googleapis.com - Run
createBackupon any database with files larger than the multipart threshold (~5 MB) - Observe
MalformedCompleteMultipartUploadRequestin the mongod logs
Related Jira tickets:
- PSMDB-1892 — the aws-sdk-cpp upgrade that introduced this regression
- PSMDB-731 — added multipart upload support for GCS in createBackup
- PSMDB-715 — original EntityTooLarge issue that motivated multipart uploads
Has anyone else hit this with GCS or other S3-compatible backends (MinIO, Ceph RGW)?
Disclosure: The source code analysis was performed with AI assistance and verified by a human engineer. The problem is real and actively impacts our production environment.