You’re using the hot backup of Percona Server for MongoDB I see. I’ve moved this post from Percona Backup for MongoDB - which is an external backup tool capable of cluster-consistent backups with point-in-time-recovery - to this forum channel for PSMDB.
Thanks for the log snippet, that’s helpful. I’ll take a look.
I printed out the AWS error messages from the JSON mongod log that was already filtered to that time and “c”: “STORAGE” messages only using jq -r .attr.statement [json mongodb log file] > aws_msg_only.txt
[ERROR] 2021-02-12 13:59:19 CurlHttpClient [235235678932566] Curl returned error code 55 - Failed sending data to the peer
[ERROR] 2021-02-12 13:59:19 AWSClient [235235678932566] HTTP response code: 0
Exception name:
Error message: Unable to connect to endpoint
0 response headers:
[WARN] 2021-02-12 13:59:19 AWSClient [235235678932566] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
[WARN] 2021-02-12 13:59:19 AWSClient [235235678932566] Request failed, now waiting 0 ms before attempting again.
[INFO] 2021-02-12 13:59:19 CurlHandleContainer [235235678932566] Connection has been released. Continuing.
The backup command seems to have started at 13:49:21 and was fine until 13:59:19.
The AWS library did start the connection and seems to have proceeded for 9.5 mins OK. And you mentioned that some files were already copied. So to start throwing some ideas around it seems like a transient error or maybe a timeout enforced by something in the OS or network doing strict firewall rules?
After web-searching around for a bit I didn’t find an exact match for the symptoms, but AWS config (such as what the region is, whether SSE is being used (maybe), etc.) may also be a factor. As well as a genuine clock skew issue, as one of the lines suggests.
At any rate the createBackup command of Percona Server for MongoDB has moved to a function where it is iterating through the WiredTiger files to copy, it is writing to the S3 endpoint, and at a some time that (network) IO destination fails. So focusing on the AWS error messages is the right direction I think.
max_attempts = 20
retry_mode = standard
tcp_keepalive = true
More files were uploaded and the error changed
{"t":{"$date":"2021-02-15T16:53:02.915+02:00"},"s":"I", "c":"STORAGE", "id":29011, "ctx":"conn200","msg":"{statement}","attr":{"statement":"[ERROR] 2021-02-15 14:53:02 AWSClient [139775339403008] HTTP response code: 400\nException name: \nError message: No response body.\n4 response headers:\nconnection : close\ncontent-length : 0\ndate : Mon, 15 Feb 21 14:53:02 GMT\nx-amz-request-id : 5E4D23340605093F\n"}}
{"t":{"$date":"2021-02-15T16:53:02.915+02:00"},"s":"I", "c":"STORAGE", "id":29011, "ctx":"conn200","msg":"{statement}","attr":{"statement":"[WARN] 2021-02-15 14:53:02 AWSClient [139775339403008] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.\n"}}
{"t":{"$date":"2021-02-15T16:53:02.915+02:00"},"s":"I", "c":"STORAGE", "id":29011, "ctx":"conn200","msg":"{statement}","attr":{"statement":"[INFO] 2021-02-15 14:53:02 CurlHandleContainer [139775339403008] Cleaning up CurlHandleContainer.\n"}}
The retry and tcp_keepalive setting change attempts were a good idea.
The status as MongoDB issue is still the same though - createBackup command is presumably sending another chunk of a file to the AWS bucket by a HTTP request, and that HTTP request fails.
The first log sample showed: Curl returned error code 55 - Failed sending data to the peer, followed by “HTTP Response code: 0” (which is a N/A code). I read this as TCP was blocked/interrupted before the httpd daemon on the other side (in AWS) was even reached.
The second log sample showed:
HTTP response code: 400
Exception name:
Error message: No response body
This shows TCP connectivity occurred but the other side had an error. Or was a deliberately rejected for some security policy (giving no reason when rejecting a request is a standard security tactic in lots of server software).
So it’s a mixed bag still.
Originally I was focused on “If the signature check failed. This could be because of a time skew” because I have had experience (unrelated to MongoDB) with time-based tokens etc. that rejected because there was a clock skew. But the “If” at the start of that sentence deserves attention - we haven’t seen another error message saying clearly that a signature check failed.
Q. The copy of multiple files begins and some succeed at first, right? I ask this to confirm that it is not an original connection rejection.
Basically I think the AWS S3 http requests are being lost due to the network, and that if you were iterating through the files in the dbPath directory and sending them one by one using “aws s3 cp …”(*) you would encounter the same error.
The AWS library being used in PSMDB is the C++ SDK from AWS rather than the python one used by the AWS CLI. So they’re not perfectly the same. But I have faith that they’d both be doing this most-common type of S3 operation correctly, if not perfectly the same.
You could attempt to increase the information put in the log whilst it runs by raising the log level to 2. This will produce a lot more messages because there are many debug log calls at level 2 , even if we restrict just to the “STORAGE” component. (You can search for “LOGV2” in code of hotBackup() to see example of the messages I mean.)
db.setLogLevel(2, "storage")
db.adminCommand({ createBackup: 1, .... });
db.setLogLevel(-1, "storage"); /* Revert storage to global log level again. That is 0 by default */
//I will strongly suggest running the 'db.setLogLevel(-1, "storage");' command from
// another shell instance after 20 mins at most rather than wait for
// createBackup to finish normally. If it takes an hour for example the log file may be 10's of GB! Log level 2 is a danger of its own.
But I’m not hopeful this will provide any new perspective. The error messages you identified at the start (eg. CurlHttpClient … Curl returned error code 55 - Failed sending data to the peer) will only appear again next to otherwise normal AWS commands I believe. The curl library is telling us that the HTTP request outside of the mongod process was abruptly terminated.
* Or “aws s3api put-object …”, or “aws s3api create-multipart-upload”+“aws s3api upload-part …”