Backup failing after update to 2.0.1

robertomurta · October 26, 2022, 9:26pm

After updating our cluster with pbm-agent from 1.6.x to 2.0.1, backup process start failing.
So far I could find and fix permission issue:

The user pbm-agent runs on, changed from pbm to mongod, however I use Debian, with mongodb repositories, and default user is mongodb, I was able to change systemctl service file and some permissios.

There are other issues I still could not fix.

Getting the messages:

  Snapshots:
    2022-10-26T20:17:56Z 0.00B <logical> [ERROR: get file 2022-10-26T20:17:56Z/rs0/metadata.json: no such file] [2022-10-26T20:17:59Z]
    2022-10-24T15:32:24Z 0.00B <logical> [ERROR: get file 2022-10-24T15:32:24Z_rs0.oplog.s2: file is empty] [2022-10-25T02:50:18Z]

/var/log/pbm-agent.log is empty, even tried to change permission owner to mongodb and remove the file.

I’m using an S3 compatible API at backblaze, which was working fine before updating, storage endpoint config follows:

pitr:
  enabled: true
  oplogSpanMin: 0
  compression: s2
storage:
  type: s3
  s3:
    provider: aws
    region: us-********
    endpointUrl: https://s3.us-********.backblazeb2.com
    bucket: ******-mongo-rs0
    prefix: data/pbm/backup
    credentials:
      access-key-id: '***'
      secret-access-key: '***'
    insecureSkipTLSVerify: false

When running “pbm backup” command, the job seems to start, but the “no such file” error immediately appears on “pbm status” command, backup seems to run and uses resources.

robertomurta · October 31, 2022, 6:07pm

After trying again for some time, snapshot logical backup was successful completed. However, oplog file for pitr backup was zero in size with following error message:

S3 us-******** s3://https://s3.us-********.backblazeb2.com/***-mongo-rs0/data/pbm/backup
  Snapshots:
    2022-10-28T19:29:44Z 0.00B <logical> [ERROR: get file 2022-10-28T19:29:44Z/rs0/local.oplog.rs.bson.s2: file is empty] [2022-10-28T23:36:24Z]
    2022-10-27T15:48:46Z 0.00B <logical> [ERROR: get file 2022-10-27T15:48:46Z/rs0/local.oplog.rs.bson.s2: file is empty] [2022-10-28T03:04:08Z]

oleksandr.havryliak · November 8, 2022, 10:51am

Hi @robertomurta !
I can confirm such errors while backup is still in progress, e.g

==================
Snapshot backup "2022-11-08T10:45:14Z", started at 2022-11-08T10:45:15Z. Status: snapshot backup. [op id: 636a333af5b7f77353a36b1a]

Backups:
========
S3 us-east-1 s3://s3.us-west-004.backblazeb2.com/pbmtest
  Snapshots:
    2022-11-08T10:45:14Z 0.00B <logical> [ERROR: get file 2022-11-08T10:45:14Z/rs/local.oplog.rs.bson.s2: no such file] [2022-11-08T10:45:18Z]

But as soon as it has finished - status ok.

Currently running:
==================
(none)

Backups:
========
S3 us-east-1 s3://s3.us-west-004.backblazeb2.com/pbmtest
  Snapshots:
    2022-11-08T10:45:14Z 100.05MB <logical> [restore_to_time: 2022-11-08T10:45:41Z]

You can try to perform backup with --wait flag so pbm cli will wait for finishing overall process:

pbm backup --wait

Andrew_Pogrebnoi · November 8, 2022, 11:16am

We made a ticket for that [PBM-995] Status errors for logical backups while running - Percona JIRA

robertomurta · November 8, 2022, 4:18pm

Yes, I did observed that the first error I described was gone after backup completes, however I still have errors after finishing with oplogs missing, as described on my second post and also on discord.
Increasing replSet did not help.
I’m frustated trying to find the problem on logs, found no apparent error on remote storage logs or backup. just missing oplog insufficient range. What am I missing?

[backup/2022-11-02T23:39:49Z] mark RS as error oplog: read data: oplog has insufficient range, some records since the last saved ts {1667432391 51} are missing. Run pbm backup to create a valid starting point for the PITR.: <nil>

Andrew_Pogrebnoi · November 9, 2022, 8:55am

Looks like an oplog window is too small on the replica set. Meaning by the time PBM has finished the snapshot dump the oldest event in the oplog was too new to cover the backup.
Can you post:

full pbm status output
pbm logs -sD -t0 -e backup/2022-11-02T23:39:49Z output
and output of rs.printReplicationInfo() from the mongo shell of the replica set that failed

And what do you mean by “Increasing replSet did not help”?

robertomurta · November 9, 2022, 2:05pm

Sorry, I meant, increasing configured Oplog size.
But now I found that the replica it selected for backup had small oplog size, somehow I failed to resize it on all the replicas. I’m fixing on all replicas and trying again.
I’m sending all the logs as requested:

pbm status

S3 xxxxxxx s3://https://s3.xxxxxxx.backblazeb2.com/*****rs0/data/pbm/backup
  Snapshots:
    2022-11-03T18:43:51Z 0.00B <logical> [ERROR: get file 2022-11-03T18:43:51Z/rs0/local.oplog.rs.bson.s2: file is empty] [2022-11-04T00:44:50Z]
    2022-11-02T23:39:49Z 0.00B <logical> [ERROR: get file 2022-11-02T23:39:49Z/rs0/local.oplog.rs.bson.s2: file is empty] [2022-11-03T05:07:15Z]
    2022-08-15T17:10:07Z 408.83GB <logical> [restore_to_time: 2022-08-15T20:21:45Z]
    2022-08-01T17:10:02Z 371.76GB <logical> [restore_to_time: 2022-08-01T19:54:47Z]
    2022-07-15T17:10:01Z 334.11GB <logical> [restore_to_time: 2022-07-15T20:25:41Z]
    2022-07-01T17:10:02Z 303.76GB <logical> [restore_to_time: 2022-07-01T19:29:22Z]
    2022-06-24T17:51:31Z 290.30GB <logical> [restore_to_time: 2022-06-24T19:50:54Z]
    2022-06-08T20:36:42Z 264.01GB <logical> [restore_to_time: 2022-06-08T23:07:23Z]
    2022-05-18T15:36:30Z 230.80GB <logical> [restore_to_time: 2022-05-18T17:45:40Z]
    2022-04-11T20:32:18Z 171.57GB <logical> [restore_to_time: 2022-04-11T22:19:17Z]
    2022-03-15T17:10:02Z 78.50GB <logical> [restore_to_time: 2022-03-15T22:33:28Z]
    2022-02-16T14:40:59Z 434.35GB <logical> [restore_to_time: 2022-02-16T17:39:21Z]
    2022-02-04T19:19:16Z 522.32GB <logical> [restore_to_time: 2022-02-04T22:08:28Z]
    2021-12-16T14:32:14Z 438.61GB <logical> [restore_to_time: 2021-12-16T18:03:33Z]
    2021-12-02T19:19:28Z 418.04GB <logical> [restore_to_time: 2021-12-02T21:28:10Z]
  PITR chunks [1.50TB]:
    2022-06-24T19:50:55Z - 2022-09-01T17:10:03Z
    2022-06-08T23:07:24Z - 2022-06-23T18:02:19Z
    2022-05-18T17:45:41Z - 2022-06-01T17:10:05Z
    2022-04-11T22:19:18Z - 2022-04-15T17:10:11Z
    2022-03-15T22:33:29Z - 2022-04-01T17:10:05Z
    2022-02-16T17:39:22Z - 2022-03-01T17:10:04Z
    2022-02-04T22:08:29Z - 2022-02-15T17:10:04Z
    2021-12-16T18:03:34Z - 2022-01-01T17:10:03Z
    2021-12-02T21:28:11Z - 2021-12-15T17:10:03Z

pbm logs -sD -t0 -e backup/2022-11-02T23:39:49Z:

2022-11-02T23:39:50Z D [rs0/db2.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] init backup meta
2022-11-02T23:39:50Z D [rs0/db2.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] nomination list for rs0: [[db1.xxxx.xxx:27017 db3.xxxx.xxx:27017 db4.xxxx.xxx:27017] [db2.xxxx.xxx:27017]]
2022-11-02T23:39:50Z D [rs0/db2.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] nomination rs0, set candidates [db1.xxxx.xxx:27017 db3.xxxx.xxx:27017 db4.xxxx.xxx:27017]
2022-11-02T23:39:50Z I [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] backup started
2022-11-02T23:39:50Z D [rs0/db2.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] skip after nomination, probably started by another node
2022-11-02T23:39:51Z D [rs0/db3.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] skip after nomination, probably started by another node
2022-11-02T23:39:51Z D [rs0/db1.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] skip after nomination, probably started by another node
2022-11-02T23:39:54Z D [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] wait for tmp users {1667432394 20}
2022-11-02T23:39:55Z D [rs0/db2.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] bcp nomination: rs0 won by db4.xxxx.xxx:27017
2022-11-03T05:07:11Z I [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] mongodump finished, waiting for the oplog
2022-11-03T05:07:14Z D [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] set oplog span to {1667432391 51} / {1667452031 10}
2022-11-03T05:07:15Z I [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] dropping tmp collections
2022-11-03T05:07:15Z I [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] mark RS as error `oplog: read data: oplog has insufficient range, some records since the last saved ts {1667432391 51} are missing. Run `pbm backup` to create a valid starting point for the PITR.`: <nil>
2022-11-03T05:07:15Z I [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] mark backup as error `oplog: read data: oplog has insufficient range, some records since the last saved ts {1667432391 51} are missing. Run `pbm backup` to create a valid starting point for the PITR.`: <nil>
2022-11-03T05:07:16Z E [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] backup: oplog: read data: oplog has insufficient range, some records since the last saved ts {1667432391 51} are missing. Run `pbm backup` to create a valid starting point for the PITR.
2022-11-03T05:07:16Z D [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] releasing lock

rs0:SECONDARY> rs.printReplicationInfo()
configured oplog size:   2262.2998046875MB
log length start to end: 1188secs (0.33hrs)
oplog first event time:  Wed Nov 09 2022 09:33:53 GMT-0400 (-04)
oplog last event time:   Wed Nov 09 2022 09:53:41 GMT-0400 (-04)
now:                     Wed Nov 09 2022 09:53:41 GMT-0400 (-04)

robertomurta · November 10, 2022, 2:53pm

Thank you Andrew_Pogrebnoi
Updating situation: I had to increase oplog storage as 35G I first used was not enough to store 6h that backup takes to complete, I went all on 60G (which I think it is huge), but now can store over 11h of logs:

rs0:SECONDARY> rs.printReplicationInfo()
configured oplog size:   60000MB
log length start to end: 42953secs (11.93hrs)
oplog first event time:  Wed Nov 09 2022 22:47:53 GMT-0400 (-04)
oplog last event time:   Thu Nov 10 2022 10:43:46 GMT-0400 (-04)
now:                     Thu Nov 10 2022 10:43:47 GMT-0400 (-04)

For newbies like me: It is important to set oplog storage on each replica, using local dataset.

hemang_trivedi · January 2, 2025, 12:08pm

HI @Andrew_Pogrebnoi, please help here,
i was going to this thread since am also facing similer issue, that /metadata.json": no such file i had increased the ooplog value from 10 GB to 20 GB, still my issue is not going away,
please help me to fix this error , i am running PBM backup 2.8.0 verion, on my mongob shard

Cluster:
========
SxxxxShardOh1_0:
  - saxxxxx.com:27017 [P]: pbm-agent [v2.8.0] OK
  - saxxxxx:27017 [S]: pbm-agent [v2.8.0] OK
  - saxxxxx:27017 [S]: pbm-agent [v2.8.0] OK
SxxxxShardOh1_1:
  - saxxxxx.com:27017 [P]: pbm-agent [v2.8.0] OK
  - saxxxxx.com:27017 [S]: pbm-agent [v2.8.0] OK
  - saxxxxx.com:27017 [S]: pbm-agent [v2.8.0] OK
configRS:
  - sxxx-config-mongo-xxx.com:27017 [S]: pbm-agent [v2.8.0] OK
  - sxxx-config-mongo-xxx.com:27017 [P]: pbm-agent [v2.8.0] OK
  - sxxx-config-mongo-xxx.com:27017 [S]: pbm-agent [v2.8.0] OK


PITR incremental backup:
========================
Status [OFF]

Currently running:
==================
(none)

Backups:
========
FS  /data/backup
  Snapshots:
    2025-01-02T11:25:31Z 68.98MB <logical> [ERROR: check backup files: parse metafile "2025-01-02T11:25:31Z/Shard1_1/metadata.json": open "2025-01-02T11:25:31Z/Shard1_1/metadata.json": no such file
file "2025-01-02T11:25:31Z/Shard1_1/metadata.json": no such file
file "2025-01-02T11:25:31Z/Shard1_0/metadata.json": no such file
parse metafile "2025-01-02T11:25:31Z/Shard1_0/metadata.json": open "2025-01-02T11:25:31Z/Shard1_0/metadata.json": no such file] [2025-01-02T11:25:58Z]
    2025-01-02T11:20:30Z 67.50MB <logical> [ERROR: check backup files: parse metafile "2025-01-02T11:20:30Z/Shard1_1/metadata.json": open "2025-01-02T11:20:30Z/Shard1_1/metadata.json": no such file
file "2025-01-02T11:20:30Z/Shard1_1/metadata.json": no such file
parse metafile "2025-01-02T11:20:30Z/Shard1_0/metadata.json": open "2025-01-02T11:20:30Z/Shard1_0/metadata.json": no such file
file "2025-01-02T11:20:30Z/Shard1_0/metadata.json": no such file] [2025-01-02T11:20:47Z]

pbm logs

2025-01-02T11:25:58Z D [configRS/config-mongo-03:27017] [backup/2025-01-02T11:25:31Z] set balancer on
2025-01-02T11:25:58Z E [configRS/config-mongo-03:27017] [backup/2025-01-02T11:25:31Z] backup: check backup files: parse metafile "2025-01-02T11:25:31Z/shard1_1/metadata.json": open "2025-01-02T11:25:31Z/shard1_1/metadata.json": no such file
file "2025-01-02T11:25:31Z/shard1_1/metadata.json": no such file
file "2025-01-02T11:25:31Z/shard1_0/metadata.json": no such file
parse metafile "2025-01-02T11:25:31Z/shard1_0/metadata.json": open "2025-01-02T11:25:31Z/shard1_0/metadata.json": no such file
2025-01-02T11:25:58Z D [configRS/config-mongo-03:27017] [backup/2025-01-02T11:25:31Z] releasing lock
2025-01-02T11:25:59Z I [shard1_0/shard0-mongod-02:27017] [backup/2025-01-02T11:25:31Z] mark RS as error `waiting for done: cluster failed: <nil>`: <nil>
2025-01-02T11:25:59Z D [shard1_0/shard0-mongod-02:27017] [backup/2025-01-02T11:25:31Z] set balancer on
2025-01-02T11:25:59Z E [shard1_0/shard0-mongod-02:27017] [backup/2025-01-02T11:25:31Z] backup: waiting for done: cluster failed: <nil>
2025-01-02T11:25:59Z D [shard1_0/shard0-mongod-02:27017] [backup/2025-01-02T11:25:31Z] releasing lock
2025-01-02T11:25:59Z I [shard1_1/shard1-mongod-02:27017] [backup/2025-01-02T11:25:31Z] mark RS as error `waiting for done: cluster failed: <nil>`: <nil>
2025-01-02T11:25:59Z D [shard1_1/shard1-mongod-02:27017] [backup/2025-01-02T11:25:31Z] set balancer on
2025-01-02T11:25:59Z E [shard1_1/shard1-mongod-02:27017] [backup/2025-01-02T11:25:31Z] backup: waiting for done: cluster failed: <nil>
2025-01-02T11:25:59Z D [shard1_1/shard1-mongod-02:27017] [backup/2025-01-02T11:25:31Z] releasing lock

sophie · April 5, 2025, 5:26am

Hi, having the same exact error as hemang_trivedi and using pbm version 2.9.0. Follow up on this please. Thank you!

Topic		Replies	Views
PBM backup failed Percona Backup for MongoDB mongodb	7	2165	November 30, 2021
Percona for MongoDB new PBM agent version Percona Backup for MongoDB	2	208	May 27, 2024
Mongodb backup errors - percona backup Percona Backup for MongoDB	7	2118	April 15, 2021
Local.oplog.rs.bson.s2: no such file Percona Backup for MongoDB	1	751	January 10, 2023
Unexpected backup failure - convergeCluster lost shard? Percona Backup for MongoDB	12	2336	July 26, 2021

Backup failing after update to 2.0.1

Related topics