Backup failing after update to 2.0.1

After updating our cluster with pbm-agent from 1.6.x to 2.0.1, backup process start failing.
So far I could find and fix permission issue:

  • The user pbm-agent runs on, changed from pbm to mongod, however I use Debian, with mongodb repositories, and default user is mongodb, I was able to change systemctl service file and some permissios.

There are other issues I still could not fix.

  • Getting the messages:
  Snapshots:
    2022-10-26T20:17:56Z 0.00B <logical> [ERROR: get file 2022-10-26T20:17:56Z/rs0/metadata.json: no such file] [2022-10-26T20:17:59Z]
    2022-10-24T15:32:24Z 0.00B <logical> [ERROR: get file 2022-10-24T15:32:24Z_rs0.oplog.s2: file is empty] [2022-10-25T02:50:18Z]
  • /var/log/pbm-agent.log is empty, even tried to change permission owner to mongodb and remove the file.

I’m using an S3 compatible API at backblaze, which was working fine before updating, storage endpoint config follows:

pitr:
  enabled: true
  oplogSpanMin: 0
  compression: s2
storage:
  type: s3
  s3:
    provider: aws
    region: us-********
    endpointUrl: https://s3.us-********.backblazeb2.com
    bucket: ******-mongo-rs0
    prefix: data/pbm/backup
    credentials:
      access-key-id: '***'
      secret-access-key: '***'
    insecureSkipTLSVerify: false

When running “pbm backup” command, the job seems to start, but the “no such file” error immediately appears on “pbm status” command, backup seems to run and uses resources.

1 Like

After trying again for some time, snapshot logical backup was successful completed. However, oplog file for pitr backup was zero in size with following error message:

S3 us-******** s3://https://s3.us-********.backblazeb2.com/***-mongo-rs0/data/pbm/backup
  Snapshots:
    2022-10-28T19:29:44Z 0.00B <logical> [ERROR: get file 2022-10-28T19:29:44Z/rs0/local.oplog.rs.bson.s2: file is empty] [2022-10-28T23:36:24Z]
    2022-10-27T15:48:46Z 0.00B <logical> [ERROR: get file 2022-10-27T15:48:46Z/rs0/local.oplog.rs.bson.s2: file is empty] [2022-10-28T03:04:08Z]
1 Like

Hi @robertomurta !
I can confirm such errors while backup is still in progress, e.g

==================
Snapshot backup "2022-11-08T10:45:14Z", started at 2022-11-08T10:45:15Z. Status: snapshot backup. [op id: 636a333af5b7f77353a36b1a]

Backups:
========
S3 us-east-1 s3://s3.us-west-004.backblazeb2.com/pbmtest
  Snapshots:
    2022-11-08T10:45:14Z 0.00B <logical> [ERROR: get file 2022-11-08T10:45:14Z/rs/local.oplog.rs.bson.s2: no such file] [2022-11-08T10:45:18Z]

But as soon as it has finished - status ok.

Currently running:
==================
(none)

Backups:
========
S3 us-east-1 s3://s3.us-west-004.backblazeb2.com/pbmtest
  Snapshots:
    2022-11-08T10:45:14Z 100.05MB <logical> [restore_to_time: 2022-11-08T10:45:41Z]

You can try to perform backup with --wait flag so pbm cli will wait for finishing overall process:

pbm backup --wait
1 Like

We made a ticket for that [PBM-995] Status errors for logical backups while running - Percona JIRA

1 Like

Yes, I did observed that the first error I described was gone after backup completes, however I still have errors after finishing with oplogs missing, as described on my second post and also on discord.
Increasing replSet did not help.
I’m frustated trying to find the problem on logs, found no apparent error on remote storage logs or backup. just missing oplog insufficient range. What am I missing?

[backup/2022-11-02T23:39:49Z] mark RS as error oplog: read data: oplog has insufficient range, some records since the last saved ts {1667432391 51} are missing. Run pbm backup to create a valid starting point for the PITR.: <nil>

1 Like

Looks like an oplog window is too small on the replica set. Meaning by the time PBM has finished the snapshot dump the oldest event in the oplog was too new to cover the backup.
Can you post:

  • full pbm status output
  • pbm logs -sD -t0 -e backup/2022-11-02T23:39:49Z output
  • and output of rs.printReplicationInfo() from the mongo shell of the replica set that failed

And what do you mean by “Increasing replSet did not help”?

2 Likes

Sorry, I meant, increasing configured Oplog size.
But now I found that the replica it selected for backup had small oplog size, somehow I failed to resize it on all the replicas. I’m fixing on all replicas and trying again.
I’m sending all the logs as requested:

pbm status

S3 xxxxxxx s3://https://s3.xxxxxxx.backblazeb2.com/*****rs0/data/pbm/backup
  Snapshots:
    2022-11-03T18:43:51Z 0.00B <logical> [ERROR: get file 2022-11-03T18:43:51Z/rs0/local.oplog.rs.bson.s2: file is empty] [2022-11-04T00:44:50Z]
    2022-11-02T23:39:49Z 0.00B <logical> [ERROR: get file 2022-11-02T23:39:49Z/rs0/local.oplog.rs.bson.s2: file is empty] [2022-11-03T05:07:15Z]
    2022-08-15T17:10:07Z 408.83GB <logical> [restore_to_time: 2022-08-15T20:21:45Z]
    2022-08-01T17:10:02Z 371.76GB <logical> [restore_to_time: 2022-08-01T19:54:47Z]
    2022-07-15T17:10:01Z 334.11GB <logical> [restore_to_time: 2022-07-15T20:25:41Z]
    2022-07-01T17:10:02Z 303.76GB <logical> [restore_to_time: 2022-07-01T19:29:22Z]
    2022-06-24T17:51:31Z 290.30GB <logical> [restore_to_time: 2022-06-24T19:50:54Z]
    2022-06-08T20:36:42Z 264.01GB <logical> [restore_to_time: 2022-06-08T23:07:23Z]
    2022-05-18T15:36:30Z 230.80GB <logical> [restore_to_time: 2022-05-18T17:45:40Z]
    2022-04-11T20:32:18Z 171.57GB <logical> [restore_to_time: 2022-04-11T22:19:17Z]
    2022-03-15T17:10:02Z 78.50GB <logical> [restore_to_time: 2022-03-15T22:33:28Z]
    2022-02-16T14:40:59Z 434.35GB <logical> [restore_to_time: 2022-02-16T17:39:21Z]
    2022-02-04T19:19:16Z 522.32GB <logical> [restore_to_time: 2022-02-04T22:08:28Z]
    2021-12-16T14:32:14Z 438.61GB <logical> [restore_to_time: 2021-12-16T18:03:33Z]
    2021-12-02T19:19:28Z 418.04GB <logical> [restore_to_time: 2021-12-02T21:28:10Z]
  PITR chunks [1.50TB]:
    2022-06-24T19:50:55Z - 2022-09-01T17:10:03Z
    2022-06-08T23:07:24Z - 2022-06-23T18:02:19Z
    2022-05-18T17:45:41Z - 2022-06-01T17:10:05Z
    2022-04-11T22:19:18Z - 2022-04-15T17:10:11Z
    2022-03-15T22:33:29Z - 2022-04-01T17:10:05Z
    2022-02-16T17:39:22Z - 2022-03-01T17:10:04Z
    2022-02-04T22:08:29Z - 2022-02-15T17:10:04Z
    2021-12-16T18:03:34Z - 2022-01-01T17:10:03Z
    2021-12-02T21:28:11Z - 2021-12-15T17:10:03Z

pbm logs -sD -t0 -e backup/2022-11-02T23:39:49Z:

2022-11-02T23:39:50Z D [rs0/db2.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] init backup meta
2022-11-02T23:39:50Z D [rs0/db2.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] nomination list for rs0: [[db1.xxxx.xxx:27017 db3.xxxx.xxx:27017 db4.xxxx.xxx:27017] [db2.xxxx.xxx:27017]]
2022-11-02T23:39:50Z D [rs0/db2.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] nomination rs0, set candidates [db1.xxxx.xxx:27017 db3.xxxx.xxx:27017 db4.xxxx.xxx:27017]
2022-11-02T23:39:50Z I [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] backup started
2022-11-02T23:39:50Z D [rs0/db2.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] skip after nomination, probably started by another node
2022-11-02T23:39:51Z D [rs0/db3.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] skip after nomination, probably started by another node
2022-11-02T23:39:51Z D [rs0/db1.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] skip after nomination, probably started by another node
2022-11-02T23:39:54Z D [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] wait for tmp users {1667432394 20}
2022-11-02T23:39:55Z D [rs0/db2.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] bcp nomination: rs0 won by db4.xxxx.xxx:27017
2022-11-03T05:07:11Z I [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] mongodump finished, waiting for the oplog
2022-11-03T05:07:14Z D [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] set oplog span to {1667432391 51} / {1667452031 10}
2022-11-03T05:07:15Z I [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] dropping tmp collections
2022-11-03T05:07:15Z I [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] mark RS as error `oplog: read data: oplog has insufficient range, some records since the last saved ts {1667432391 51} are missing. Run `pbm backup` to create a valid starting point for the PITR.`: <nil>
2022-11-03T05:07:15Z I [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] mark backup as error `oplog: read data: oplog has insufficient range, some records since the last saved ts {1667432391 51} are missing. Run `pbm backup` to create a valid starting point for the PITR.`: <nil>
2022-11-03T05:07:16Z E [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] backup: oplog: read data: oplog has insufficient range, some records since the last saved ts {1667432391 51} are missing. Run `pbm backup` to create a valid starting point for the PITR.
2022-11-03T05:07:16Z D [rs0/db4.xxxx.xxx:27017] [backup/2022-11-02T23:39:49Z] releasing lock
rs0:SECONDARY> rs.printReplicationInfo()
configured oplog size:   2262.2998046875MB
log length start to end: 1188secs (0.33hrs)
oplog first event time:  Wed Nov 09 2022 09:33:53 GMT-0400 (-04)
oplog last event time:   Wed Nov 09 2022 09:53:41 GMT-0400 (-04)
now:                     Wed Nov 09 2022 09:53:41 GMT-0400 (-04)
2 Likes

Thank you Andrew_Pogrebnoi
Updating situation: I had to increase oplog storage as 35G I first used was not enough to store 6h that backup takes to complete, I went all on 60G (which I think it is huge), but now can store over 11h of logs:

rs0:SECONDARY> rs.printReplicationInfo()
configured oplog size:   60000MB
log length start to end: 42953secs (11.93hrs)
oplog first event time:  Wed Nov 09 2022 22:47:53 GMT-0400 (-04)
oplog last event time:   Thu Nov 10 2022 10:43:46 GMT-0400 (-04)
now:                     Thu Nov 10 2022 10:43:47 GMT-0400 (-04)

For newbies like me: It is important to set oplog storage on each replica, using local dataset.

2 Likes