Pbm mongo restore on aks

Hello,
we have installed percona mongodb with pbm agents on aks.
3 nodes/pods, 1 primary 2 secondary, each node/pod has its pbm-agent installed
mongo version 4.4.6-8
nodes: 3
sharding: no
replica: yes
pitr: no
we did succesfully a backup of the entire cluster and saved it on Minio.
the dump backup file size is 15gb, gz compressed. Now we want to restore it using the manifest

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBRestore
metadata:
name: percona-store-mongo-restore-20211213-d-v1
namespace: percona-mongo-dev
spec:
backupName: cron-percona-store-mo-20211213000013-9gvsg
clusterName: percona-store-mongo

but we got always errors like
2021-12-13T11:18:32.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-13T11:18:32.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 1s
2021-12-13T11:18:33.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download

from aks resource manifest
check cluster for restore dump done: convergeCluster: restore on the shard
rs0 failed with: restore mongo dump (successes: 5767909 / fails: 0):
xxxx: error restoring from archive on stdin: reading
bson input: error demultiplexing archive; archive io error

can someone help us understanding where could be the problem?
Could be a problem of download time from minio?

1 Like

Hi @salvo,

Could you please share the operator logs and pbm-agent logs from each pod?

$ kubectl logs <operator-pod>
$ kubectl logs <cluster>-rs0-0 pbm-agent # repeat for each rs0 pod
1 Like

Hello @egegunes ,
here they are
operator

{“level”:“info”,“ts”:1639468601.797091,“logger”:“controller_perconaservermongodbrestore”,“msg”:“waiting for mongos termination”}
{“level”:“info”,“ts”:1639468601.8832114,“logger”:“controller_perconaservermongodbrestore”,“msg”:“waiting for mongos termination”}
{“level”:“info”,“ts”:1639468603.1849604,“logger”:“controller_psmdb”,“msg”:“balancer disabled”}
{“level”:“info”,“ts”:1639468605.5126054,“logger”:“controller_perconaservermongodbrestore”,“msg”:“Waiting for restore metadata”,“PBM name”:“2021-12-14T07:56:45.349643054Z”,“restore”:“percona-store-mongo-restore-20211213-d-v1”,“backup”:“cron-percona-store-mo-20211213000013-9gvsg”}

rs0-0

2021-12-14T07:56:45.000+0000 I got command restore [name: 2021-12-14T07:56:45.349643054Z, backup name: 2021-12-13T00:00:23Z] <ts: 1639468605>
2021-12-14T07:56:45.000+0000 I got epoch {1639468604 2}
2021-12-14T07:56:45.000+0000 I [restore/2021-12-13T00:00:23Z] node is not primary so it unsuitable to do restore

rs0-1

2021-12-14T07:56:45.000+0000 I got command restore [name: 2021-12-14T07:56:45.349643054Z, backup name: 2021-12-13T00:00:23Z] <ts: 1639468605>
2021-12-14T07:56:45.000+0000 I got epoch {1639468604 2}
2021-12-14T07:56:45.000+0000 I [restore/2021-12-13T00:00:23Z] node is not primary so it unsuitable to do restore

rs0-2

2021-12-14T00:00:21.000+0000 I got command backup [name: 2021-12-14T00:00:20Z, compression: gzip] <ts: 1639440020>
2021-12-14T00:00:21.000+0000 I got epoch {1639440020 1}
2021-12-14T00:00:38.000+0000 D [backup/2021-12-14T00:00:20Z] skip after nomination, probably started by another node
2021-12-14T07:56:45.000+0000 I got command restore [name: 2021-12-14T07:56:45.349643054Z, backup name: 2021-12-13T00:00:23Z] <ts: 1639468605>
2021-12-14T07:56:45.000+0000 I got epoch {1639468604 2}
2021-12-14T07:56:45.000+0000 I [restore/2021-12-13T00:00:23Z] restore started
2021-12-14T07:56:48.761+0000 connected to node type: replset
2021-12-14T07:56:48.781+0000 archive prelude semantic.Semantic.MRSTRTPLMR
2021-12-14T07:56:48.781+0000 archive prelude semantic.Semantic.BYND

|2021-12-14T07:56:48.859+0000|archive prelude catalog.Attribute.Device|
|—|—|
|2021-12-14T07:56:48.859+0000|archive prelude admin.system.users|
|2021-12-14T07:56:48.859+0000|archive prelude admin.system.roles|
|2021-12-14T07:56:48.859+0000|archive prelude admin.system.version|
|2021-12-14T07:56:48.859+0000|archive format version 0.1|
|2021-12-14T07:56:48.859+0000|archive server version 4.4.6-8|
|2021-12-14T07:56:48.859+0000|archive tool version 0.0.1|
|2021-12-14T07:56:48.859+0000|setting number of parallel collections to number of parallel collections in archive (2)|
|2021-12-14T07:56:48.864+0000|preparing collections to restore from|
|2021-12-14T07:56:48.864+0000|found collection semantic.Semantic.MRSTRTPLMR bson to restore to semantic.Semantic.MRSTRTPLMR|
2021-12-14T07:56:48.873+0000 found collection metadata from semantic.Semantic.MTN1 to restore to semantic.Semantic.MTN1
2021-12-14T07:56:48.873+0000 found collection admin.pbmRUsers bson to restore to admin.pbmRUsers
2021-12-14T07:56:48.873+0000 found collection metadata from admin.pbmRUsers to restore to admin.pbmRUsers
2021-12-14T07:56:48.873+0000 skipping restore of system.profile collection in admin
2021-12-14T07:56:48.873+0000 skipping restore of system.profile metadata
2021-12-14T07:56:48.873+0000 found collection admin.pbmRRoles bson to restore to admin.pbmRRoles
2021-12-14T07:56:48.873+0000 found collection metadata from admin.pbmRRoles to restore to admin.pbmRRoles
2021-12-14T07:56:48.873+0000 found collection admin.system.users bson to restore to admin.system.users
2021-12-14T07:56:48.873+0000 found collection metadata from admin.system.users to restore to admin.system.users

2021-12-14T07:56:48.876+0000 found collection catalog.EntityType bson to restore to catalog.EntityType
2021-12-14T07:56:48.876+0000 found collection metadata from catalog.EntityType to restore to catalog.EntityType
2021-12-14T07:56:48.876+0000 found collection catalog.Datasource bson to restore to catalog.Datasource
2021-12-14T07:56:48.876+0000 found collection metadata from catalog.Datasource to restore to catalog.Datasource
2021-12-14T07:56:48.876+0000 found collection catalog.Attribute.Device bson to restore to catalog.Attribute.Device
2021-12-14T07:56:48.876+0000 found collection metadata from catalog.Attribute.Device to restore to catalog.Attribute.Device
2021-12-14T07:56:48.876+0000 found collection benchmark.test bson to restore to benchmark.test
2021-12-14T07:56:48.876+0000 found collection metadata from benchmark.test to restore to benchmark.test
2021-12-14T07:56:48.876+0000 found collection estimation.test bson to restore to estimation.test
2021-12-14T07:56:48.876+0000 found collection metadata from estimation.test to restore to estimation.test
2021-12-14T07:56:48.964+0000 first non special collection semantic.Semantic.BRKHG_B found. The demultiplexer will handle it and the remainder
2021-12-14T07:56:48.964+0000 restoring up to 2 collections in parallel
2021-12-14T07:56:48.978+0000 dropping collection semantic.Semantic.BRKHG_B before restoring
2021-12-14T07:56:48.984+0000 reading metadata for semantic.Semantic.BRKHG_B from archive on stdin
2021-12-14T07:56:48.985+0000 creating collection semantic.Semantic.BRKHG_B using options from metadata
2021-12-14T07:56:49.057+0000 restoring semantic.Semantic.BRKHG_B from archive on stdin
2021-12-14T07:56:49.062+0000 using 10 insertion workers
2021-12-14T07:56:51.557+0000 dropping collection semantic.Semantic.TDRNG_5 before restoring
2021-12-14T07:56:51.564+0000 reading metadata for semantic.Semantic.TDRNG_5 from archive on stdin
2021-12-14T07:56:51.564+0000 creating collection semantic.Semantic.TDRNG_5 using options from metadata
2021-12-14T07:56:51.589+0000 restoring semantic.Semantic.TDRNG_5 from archive on stdin
2021-12-14T07:56:51.595+0000 using 10 insertion workers
2021-12-14T08:03:49.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:03:49.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 1s
2021-12-14T08:03:50.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:05:12.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:05:12.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 1s
2021-12-14T08:05:13.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:06:39.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:06:39.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 1s
2021-12-14T08:06:40.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:07:11.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:07:11.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 2s
2021-12-14T08:07:13.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:07:43.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:07:43.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 3s
2021-12-14T08:07:46.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:08:17.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:08:17.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 4s
2021-12-14T08:08:21.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:08:52.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:08:52.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 5s
2021-12-14T08:08:57.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:09:27.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:09:27.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 6s
2021-12-14T08:09:33.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:10:03.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:10:03.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 7s
2021-12-14T08:10:10.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:10:41.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:10:41.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 8s
2021-12-14T08:10:49.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:11:19.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:11:19.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 9s
2021-12-14T08:11:28.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:11:59.000+0000 W [restore/2021-12-13T00:00:23Z] errReadObj Err: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:11:59.000+0000 W [restore/2021-12-13T00:00:23Z] got context deadline exceeded (Client.Timeout or context cancellation while reading body), try to reconnect in 10s
2021-12-14T08:12:09.000+0000 I [restore/2021-12-13T00:00:23Z] session recreated, resuming download
2021-12-14T08:12:09.000+0000 E [restore/2021-12-13T00:00:23Z] download ‘bk-mongo-dev/2021-12-13T00:00:23Z_rs0.dump.gz’ file from S3: context deadline exceeded (Client.Timeout or context cancellation while reading body)
2021-12-14T08:12:09.374+0000 finished restoring semantic.Semantic.BRKHG_B (2839546 documents, 0 failures)
2021-12-14T08:12:09.558+0000 finished restoring semantic.Semantic.TDRNG_5 (2878599 documents, 0 failures)
2021-12-14T08:12:09.559+0000 demux finishing when there are still outs (7)
2021-12-14T08:12:09.559+0000 demux finishing (err:corruption found in archive; ParserConsumer.BodyBSON() ( corruption found in archive; read bson ( unexpected EOF ) ))
2021-12-14T08:12:09.000+0000 E [restore/2021-12-13T00:00:23Z] restore: restore mongo dump (successes: 5718145 / fails: 0): semantic.Semantic.BRKHG_B: error restoring from archive on stdin: reading bson input: error demultiplexing archive; archive io error
2021-12-14T08:12:09.000+0000 D [restore/2021-12-13T00:00:23Z] releasing lock

any news? this is blocking our restore

From the logs it seems rs0-2 cannot read the backup files from the backup location. Try manually accessing them from the host to see what happens

well, the backup worked.
do i have to use a different manifest for the restore?
a few days ago i was checking the logs and realized that the restore has dropped the collection xxx (don’t remember which one) and its indexes. then recreated it and copied the bsondocuments, but failed when trying to restore indexes.
in our db we have collections with 100/150 millions of documents. maybe we have to change some mongo setting?