Issue with Backups After Running Cleanup on MongoDB with PBM

Hello,

I’ve been testing repeatedly over the past few weeks and consistently encounter the same issue. When executing a “cleanup”, upon its completion, no backups are visible under “pbm status”, and the oplog backups also become unavailable. Despite this, backup files are confirmed to exist on the remote storage.

Node details:

  • mongo-011: primary
  • mongo-012: secondary
  • mongo-013: secondary
  • mongo-014: hidden
  • mongo-015: hidden

To provide further insight, here’s the test I ran yesterday:

# pbm status
Cluster:
========
cluster01:
  - cluster01/mongo-013.vrack:27017 [S]: pbm-agent v2.3.0 OK
  - cluster01/mongo-014.vrack:27017 [S]: pbm-agent v2.3.0 OK
  - cluster01/mongo-015.vrack:27017 [S]: pbm-agent v2.3.0 OK
  - cluster01/mongo-012.vrack:27017 [S]: pbm-agent v2.3.0 OK
  - cluster01/mongo-011.vrack:27017 [P]: pbm-agent v2.3.0 OK


PITR incremental backup:
========================
Status [ON]

Currently running:
==================
(none)

Backups:
========
S3 gra s3://https://s3.gra.perf.cloud.ovh.net/xxxxxxx01
  Snapshots:
    2023-10-25T02:00:01Z 37.73GB <logical> [restore_to_time: 2023-10-25T02:27:24Z]
    2023-10-24T02:00:01Z 39.12GB <logical> [restore_to_time: 2023-10-24T04:33:05Z]
    2023-10-23T02:00:01Z 39.10GB <logical> [restore_to_time: 2023-10-23T04:33:57Z]
    2023-10-22T02:00:02Z 37.57GB <logical> [restore_to_time: 2023-10-22T02:35:38Z]
    2023-10-21T02:00:01Z 38.94GB <logical> [restore_to_time: 2023-10-21T04:25:50Z]
    2023-10-20T02:00:01Z 37.45GB <logical> [restore_to_time: 2023-10-20T02:28:51Z]
    2023-10-19T02:00:01Z 37.41GB <logical> [restore_to_time: 2023-10-19T02:31:45Z]
    2023-10-18T02:00:01Z 37.36GB <logical> [restore_to_time: 2023-10-18T02:27:48Z]
    2023-10-17T02:00:02Z 38.79GB <logical> [restore_to_time: 2023-10-17T04:32:50Z]
    2023-10-16T02:00:01Z 38.01GB <logical> [restore_to_time: 2023-10-16T02:30:04Z]
    2023-10-15T02:00:01Z 37.95GB <logical> [restore_to_time: 2023-10-15T02:19:33Z]
    2023-10-14T02:00:02Z 38.61GB <logical> [restore_to_time: 2023-10-14T04:33:08Z]
    2023-10-13T02:00:01Z 38.47GB <logical> [restore_to_time: 2023-10-13T04:31:40Z]
    2023-10-12T08:40:53Z 39.20GB <logical> [restore_to_time: 2023-10-12T11:56:11Z]
  PITR chunks [94.93GB]:
    2023-10-12T11:56:12Z - 2023-10-25T22:57:34Z
# pbm cleanup -y --older-than 5d
(waiting few hours)
# pbm status
Cluster:
========
cluster01:
  - cluster01/mongo-013.vrack:27017 [S]: pbm-agent v2.3.0 OK
  - cluster01/mongo-014.vrack:27017 [S]: pbm-agent v2.3.0 OK
  - cluster01/mongo-015.vrack:27017 [S]: pbm-agent v2.3.0 OK
  - cluster01/mongo-012.vrack:27017 [S]: pbm-agent v2.3.0 OK
  - cluster01/mongo-011.vrack:27017 [P]: pbm-agent v2.3.0 OK


PITR incremental backup:
========================
Status [OFF]

Currently running:
==================
(none)

Backups:
========
S3 gra s3://https://s3.gra.perf.cloud.ovh.net/xxxxxxx01
  Snapshots:
  PITR chunks [1.13GB]:
    2023-10-25T23:07:34Z - 2023-10-26T01:57:34Z (no base snapshot)

The log displays the following error:

2023-10-26T07:50:09Z E [cluster01/mongo-014.vrack:27017] [pitr] init: catchup: no backup found. full backup is required to start PITR

Additionally, something that did not happen in prior tests: upon executing the cleanup command, the primary node, mongo-011, experienced a load spike to 100%. This necessitated me to stop the pbm-agent service on the primary node.

Logs pbm-agent.service for mongo-011:

Oct 25 23:02:11 mongo-011 pbm-agent[315203]: 2023-10-25T23:02:11.000+0000 I got command cleanup <ts: 1698274931>
Oct 25 23:02:11 mongo-011 pbm-agent[315203]: 2023-10-25T23:02:11.000+0000 I got epoch {1698200847 8}
Oct 25 23:10:15 mongo-011 pbm-agent[315203]: 2023-10-25T23:10:15.000+0000 E [cleanup] delete backup files "2023-10-19T02:00:01Z": get file list: "2023-10-19T02:00:01Z/": RequestError: send request failed
Oct 25 23:10:15 mongo-011 pbm-agent[315203]: caused by: Get "https://s3.gra.perf.cloud.ovh.net/xxxxxx01?marker=2023-10-19T02%3A00%3A01Z%2Fcluster01%2Fxxxxxxx.paymentstoragemovements.s2&prefix=2023-10-19T02%3A00%3A01Z%2F": dial tcp 141.95.161.76:443: connect:>
Oct 25 23:10:15 mongo-011 pbm-agent[315203]: 2023-10-25T23:10:15.000+0000 D [cleanup] got physical restores list: 0
Oct 25 23:10:55 mongo-011 pbm-agent[315203]: 2023-10-25T23:10:55.000+0000 D [cleanup] got backups list: 14
Oct 25 23:10:55 mongo-011 pbm-agent[315203]: 2023-10-25T23:10:55.000+0000 D [cleanup] bcp: 2023-10-12T08:40:53Z.pbm.json
Oct 25 23:11:04 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:04.000+0000 W [cleanup] skip snapshot 2023-10-12T08:40:53Z: file "2023-10-12T08:40:53Z/cluster01/xxxx01.partners.s2": no such file
Oct 25 23:11:04 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:04.000+0000 D [cleanup] bcp: 2023-10-13T02:00:01Z.pbm.json
Oct 25 23:11:11 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:11.000+0000 W [cleanup] skip snapshot 2023-10-13T02:00:01Z: file "2023-10-13T02:00:01Z/cluster01/xxxx01.plannings.s2": no such file
Oct 25 23:11:11 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:11.000+0000 D [cleanup] bcp: 2023-10-14T02:00:02Z.pbm.json
Oct 25 23:11:20 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:20.000+0000 W [cleanup] skip snapshot 2023-10-14T02:00:02Z: file "2023-10-14T02:00:02Z/cluster01/local.oplog.rs.bson.s2": no such file
Oct 25 23:11:20 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:20.000+0000 D [cleanup] bcp: 2023-10-15T02:00:01Z.pbm.json

Logs pbm-agent.service for mongo-012:

Oct 25 23:02:12 mongo-012 pbm-agent[333095]: 2023-10-25T23:02:12.000+0000 I got command cleanup <ts: 1698274931>
Oct 25 23:02:12 mongo-012 pbm-agent[333095]: 2023-10-25T23:02:12.000+0000 I got epoch {1698200847 8}
Oct 25 23:02:12 mongo-012 pbm-agent[333095]: 2023-10-25T23:02:12.000+0000 D [cleanup] skip: lock not acquired
Oct 26 02:00:02 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:02.000+0000 I got command backup [name: 2023-10-26T02:00:01Z, compression: s2 (level: default)] <ts: 1698285601>
Oct 26 02:00:02 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:02.000+0000 I got epoch {1698200847 8}
Oct 26 02:00:17 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:17.000+0000 D [backup/2023-10-26T02:00:01Z] nomination timeout
Oct 26 02:00:17 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:17.000+0000 D [backup/2023-10-26T02:00:01Z] skip after nomination, probably started by another node
Oct 26 02:00:37 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:37.000+0000 D [pitr] start_catchup
Oct 26 02:00:37 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:37.000+0000 E [pitr] init: catchup: no backup found. full backup is required to start PITR
Oct 26 02:01:07 mongo-012 pbm-agent[333095]: 2023-10-26T02:01:07.000+0000 D [pitr] start_catchup
Oct 26 02:01:07 mongo-012 pbm-agent[333095]: 2023-10-26T02:01:07.000+0000 E [pitr] init: catchup: no backup found. full backup is required to start PITR

Logs pbm-agent.service for mongo-013:

Oct 25 22:57:34 mongo-013 pbm-agent[328438]: 2023-10-25T22:57:34.000+0000 I [pitr] created chunk 2023-10-25T22:52:34 - 2023-10-25T22:57:34. Next chunk creation scheduled to begin at ~2023-10-25T23:02:34
Oct 25 23:02:12 mongo-013 pbm-agent[328438]: 2023-10-25T23:02:12.000+0000 I got command cleanup <ts: 1698274931>
Oct 25 23:02:12 mongo-013 pbm-agent[328438]: 2023-10-25T23:02:12.000+0000 I got epoch {1698200847 8}
Oct 25 23:02:12 mongo-013 pbm-agent[328438]: 2023-10-25T23:02:12.000+0000 D [cleanup] skip: lock not acquired
Oct 25 23:03:38 mongo-013 pbm-agent[328438]: 2023-10-25T23:03:38.000+0000 I [pitr] created chunk 2023-10-25T22:57:34 - 2023-10-25T23:02:34. Next chunk creation scheduled to begin at ~2023-10-25T23:07:34
Oct 25 23:07:50 mongo-013 pbm-agent[328438]: 2023-10-25T23:07:50.000+0000 I [pitr] created chunk 2023-10-25T23:02:34 - 2023-10-25T23:07:34. Next chunk creation scheduled to begin at ~2023-10-25T23:12:34
Oct 25 23:12:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:12:34.000+0000 I [pitr] created chunk 2023-10-25T23:07:34 - 2023-10-25T23:12:34. Next chunk creation scheduled to begin at ~2023-10-25T23:17:34
Oct 25 23:17:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:17:34.000+0000 I [pitr] created chunk 2023-10-25T23:12:34 - 2023-10-25T23:17:34. Next chunk creation scheduled to begin at ~2023-10-25T23:22:34
Oct 25 23:22:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:22:34.000+0000 I [pitr] created chunk 2023-10-25T23:17:34 - 2023-10-25T23:22:34. Next chunk creation scheduled to begin at ~2023-10-25T23:27:34
Oct 25 23:27:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:27:34.000+0000 I [pitr] created chunk 2023-10-25T23:22:34 - 2023-10-25T23:27:34. Next chunk creation scheduled to begin at ~2023-10-25T23:32:34
Oct 25 23:32:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:32:34.000+0000 I [pitr] created chunk 2023-10-25T23:27:34 - 2023-10-25T23:32:34. Next chunk creation scheduled to begin at ~2023-10-25T23:37:34
Oct 25 23:37:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:37:34.000+0000 I [pitr] created chunk 2023-10-25T23:32:34 - 2023-10-25T23:37:34. Next chunk creation scheduled to begin at ~2023-10-25T23:42:34
Oct 25 23:42:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:42:34.000+0000 I [pitr] created chunk 2023-10-25T23:37:34 - 2023-10-25T23:42:34. Next chunk creation scheduled to begin at ~2023-10-25T23:47:34
Oct 25 23:47:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:47:34.000+0000 I [pitr] created chunk 2023-10-25T23:42:34 - 2023-10-25T23:47:34. Next chunk creation scheduled to begin at ~2023-10-25T23:52:34
Oct 25 23:52:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:52:34.000+0000 I [pitr] created chunk 2023-10-25T23:47:34 - 2023-10-25T23:52:34. Next chunk creation scheduled to begin at ~2023-10-25T23:57:34
Oct 25 23:57:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:57:34.000+0000 I [pitr] created chunk 2023-10-25T23:52:34 - 2023-10-25T23:57:34. Next chunk creation scheduled to begin at ~2023-10-26T00:02:34

Logs pbm-agent.service for mongo-014:

Oct 25 23:02:12 mongo-014 pbm-agent[305773]: 2023-10-25T23:02:12.000+0000 I got command cleanup <ts: 1698274931>
Oct 25 23:02:12 mongo-014 pbm-agent[305773]: 2023-10-25T23:02:12.000+0000 I got epoch {1698200847 8}
Oct 25 23:02:12 mongo-014 pbm-agent[305773]: 2023-10-25T23:02:12.000+0000 D [cleanup] skip: lock not acquired

Logs pbm-agent.service for mongo-015:

# Nothing during this period

Additionnal infos :

cluster01:PRIMARY> db.pbmLockOp.find({})
{ "_id" : ObjectId("65399e73b2580fa52a2e0248"), "type" : "cleanup", "replset" : "cluster01", "node" : "mongo-011.vrack:27017", "opid" : "65399e73787d0cd92f116498", "epoch" : Timestamp(1698200847, 8), "hb" : Timestamp(1698275486, 2) }
#  pbm config
pitr:
  enabled: false
  oplogSpanMin: 5
  compression: s2
storage:
  type: s3
  s3:
    provider: aws
    region: gra
    endpointUrl: https://s3.gra.perf.cloud.ovh.net/
    bucket: xxxxxx01
    credentials:
      access-key-id: '***'
      secret-access-key: '***'
    maxUploadParts: 10000
    storageClass: STANDARD
    insecureSkipTLSVerify: false
backup:
  compression: s2

Any guidance on this would be greatly appreciated.

Thank you.

Hi @Florian ,

It looks like a bug was raised when the “2023-10-19T02:00:01Z” backup was deleted.

In short, not every backup file was deleted because of the S3 request failure. Metadata from storage wasn’t synced (basically, pbm config --force-resync command).

The metadata is still in storage, but PBM does not see them in DB and, therefore, asks you to make a new backup.

Please do the following:

  1. stop all agents

  2. delete all docs in admin.pbmLock and admin.pbmOpLock collections.

  3. start all agents.

  4. run pbm config --force-resync.

  5. wait for finish (“[resync] succeed” in pbm logs).

Then, you can try to rerun the cleanup.

Some backups can have errors (missed files because of partially done deletion).

Could you please create a JIRA ticket with all this details?

We will investigate how the failure of the S3 request impacts the rest of the cleanup process.

Thank you

1 Like