Hi,
i am seeing below messages in the logs while restoring the backups to a new cluster on Version2.2.1
Also i think the full cluster restore took ~24hrs whereas restore took 18hrs in older versions (2.0.5)
2023-11-13T14:57:27Z W [shard1ReplSet/10.80.11.0:27038] [restore/2023-11-13T09:29:21.900315695Z] retryChunk got copy: context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 0s
2023-11-13T14:57:27Z I [shard1ReplSet/10.80.11.0:27038] [restore/2023-11-13T09:29:21.900315695Z] session recreated, resuming download
[mongod@ip-10-80-11-188 ~]$ pbm status
Cluster:
========
shard3ReplSet:
- shard3ReplSet/10.80.11.188:27038 [P]: pbm-agent v2.2.1 OK
configReplSet:
- configReplSet/10.80.11.0:27039 [P]: pbm-agent v2.2.1 OK
shard1ReplSet:
- shard1ReplSet/10.80.11.0:27038 [P]: pbm-agent v2.2.1 OK
shard2ReplSet:
- shard2ReplSet/10.80.11.40:27038 [P]: pbm-agent v2.2.1 OK
PITR incremental backup:
========================
Status [OFF]
Currently running:
==================
(none)
Backups:
========
S3 us-east-1 s3://cm-mongo-db-shared-prod-va/percona/backup/
Snapshots:
2023-11-11T01:00:02Z 2.24TB <logical> [restore_to_time: 2023-11-11T12:53:21Z]
Is the issue repeating every time ? Did you tried running that again ?
2023-11-13T14:57:27Z W [shard1ReplSet/10.80.11.0:27038] [restore/2023-11-13T09:29:21.900315695Z] retryChunk got copy: context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 0s
2023-11-13T14:57:27Z I [shard1ReplSet/10.80.11.0:27038] [restore/2023-11-13T09:29:21.900315695Z] session recreated, resuming download
Was the network stable, and health of the target cluster was fine during the activity? Did you observe anything unusual in the MongoDB or system/kernel logs?
Still to expedite the PBM process you can tweak the parallel download depending on your hardware resources and database load. To do so you need to edit the PBM configuration file as below.
numDownloadWorkers - the number of workers to download data from the storage. By default, it equals to the number of CPU cores
maxDownloadBufferMb - the maximum size of memory buffer to store the downloaded data chunks for decompression and ordering. It is calculated as numDownloadWorkers * downloadChunkMb * 16
downloadChunkMb is the size of the data chunk to download (by default, 32 MB)
I also get these errors from time to time while I’m doing test restores:
pbm-agent[57935]: 2025-06-12T10:05:22.000+0000 W [restore/2025-06-12T09:51:51.921935763Z] failed to download chunk 1090519040-1098907647
pbm-agent[57935]: 2025-06-12T10:05:22.000+0000 W [restore/2025-06-12T09:51:51.921935763Z] retryChunk got failed to download chunk 1090519040-1098907647 (of 6006
484807) after 2 retries: copy: context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 1s
[…]
pbm-agent[57935]: 2025-06-12T11:56:28.000+0000 W [restore/2025-06-12T11:41:04.460428581Z] failed to download chunk 6006243328-6014631935
I’m wondering if that means some data from backup will not be restored or downloads are retried and there’s no worry that restore DB will be incosistent?