Hello,
I’ve been testing repeatedly over the past few weeks and consistently encounter the same issue. When executing a “cleanup”, upon its completion, no backups are visible under “pbm status”, and the oplog backups also become unavailable. Despite this, backup files are confirmed to exist on the remote storage.
Node details:
- mongo-011: primary
- mongo-012: secondary
- mongo-013: secondary
- mongo-014: hidden
- mongo-015: hidden
To provide further insight, here’s the test I ran yesterday:
# pbm status
Cluster:
========
cluster01:
- cluster01/mongo-013.vrack:27017 [S]: pbm-agent v2.3.0 OK
- cluster01/mongo-014.vrack:27017 [S]: pbm-agent v2.3.0 OK
- cluster01/mongo-015.vrack:27017 [S]: pbm-agent v2.3.0 OK
- cluster01/mongo-012.vrack:27017 [S]: pbm-agent v2.3.0 OK
- cluster01/mongo-011.vrack:27017 [P]: pbm-agent v2.3.0 OK
PITR incremental backup:
========================
Status [ON]
Currently running:
==================
(none)
Backups:
========
S3 gra s3://https://s3.gra.perf.cloud.ovh.net/xxxxxxx01
Snapshots:
2023-10-25T02:00:01Z 37.73GB <logical> [restore_to_time: 2023-10-25T02:27:24Z]
2023-10-24T02:00:01Z 39.12GB <logical> [restore_to_time: 2023-10-24T04:33:05Z]
2023-10-23T02:00:01Z 39.10GB <logical> [restore_to_time: 2023-10-23T04:33:57Z]
2023-10-22T02:00:02Z 37.57GB <logical> [restore_to_time: 2023-10-22T02:35:38Z]
2023-10-21T02:00:01Z 38.94GB <logical> [restore_to_time: 2023-10-21T04:25:50Z]
2023-10-20T02:00:01Z 37.45GB <logical> [restore_to_time: 2023-10-20T02:28:51Z]
2023-10-19T02:00:01Z 37.41GB <logical> [restore_to_time: 2023-10-19T02:31:45Z]
2023-10-18T02:00:01Z 37.36GB <logical> [restore_to_time: 2023-10-18T02:27:48Z]
2023-10-17T02:00:02Z 38.79GB <logical> [restore_to_time: 2023-10-17T04:32:50Z]
2023-10-16T02:00:01Z 38.01GB <logical> [restore_to_time: 2023-10-16T02:30:04Z]
2023-10-15T02:00:01Z 37.95GB <logical> [restore_to_time: 2023-10-15T02:19:33Z]
2023-10-14T02:00:02Z 38.61GB <logical> [restore_to_time: 2023-10-14T04:33:08Z]
2023-10-13T02:00:01Z 38.47GB <logical> [restore_to_time: 2023-10-13T04:31:40Z]
2023-10-12T08:40:53Z 39.20GB <logical> [restore_to_time: 2023-10-12T11:56:11Z]
PITR chunks [94.93GB]:
2023-10-12T11:56:12Z - 2023-10-25T22:57:34Z
# pbm cleanup -y --older-than 5d
(waiting few hours)
# pbm status
Cluster:
========
cluster01:
- cluster01/mongo-013.vrack:27017 [S]: pbm-agent v2.3.0 OK
- cluster01/mongo-014.vrack:27017 [S]: pbm-agent v2.3.0 OK
- cluster01/mongo-015.vrack:27017 [S]: pbm-agent v2.3.0 OK
- cluster01/mongo-012.vrack:27017 [S]: pbm-agent v2.3.0 OK
- cluster01/mongo-011.vrack:27017 [P]: pbm-agent v2.3.0 OK
PITR incremental backup:
========================
Status [OFF]
Currently running:
==================
(none)
Backups:
========
S3 gra s3://https://s3.gra.perf.cloud.ovh.net/xxxxxxx01
Snapshots:
PITR chunks [1.13GB]:
2023-10-25T23:07:34Z - 2023-10-26T01:57:34Z (no base snapshot)
The log displays the following error:
2023-10-26T07:50:09Z E [cluster01/mongo-014.vrack:27017] [pitr] init: catchup: no backup found. full backup is required to start PITR
Additionally, something that did not happen in prior tests: upon executing the cleanup command, the primary node, mongo-011, experienced a load spike to 100%. This necessitated me to stop the pbm-agent service on the primary node.
Logs pbm-agent.service for mongo-011:
Oct 25 23:02:11 mongo-011 pbm-agent[315203]: 2023-10-25T23:02:11.000+0000 I got command cleanup <ts: 1698274931>
Oct 25 23:02:11 mongo-011 pbm-agent[315203]: 2023-10-25T23:02:11.000+0000 I got epoch {1698200847 8}
Oct 25 23:10:15 mongo-011 pbm-agent[315203]: 2023-10-25T23:10:15.000+0000 E [cleanup] delete backup files "2023-10-19T02:00:01Z": get file list: "2023-10-19T02:00:01Z/": RequestError: send request failed
Oct 25 23:10:15 mongo-011 pbm-agent[315203]: caused by: Get "https://s3.gra.perf.cloud.ovh.net/xxxxxx01?marker=2023-10-19T02%3A00%3A01Z%2Fcluster01%2Fxxxxxxx.paymentstoragemovements.s2&prefix=2023-10-19T02%3A00%3A01Z%2F": dial tcp 141.95.161.76:443: connect:>
Oct 25 23:10:15 mongo-011 pbm-agent[315203]: 2023-10-25T23:10:15.000+0000 D [cleanup] got physical restores list: 0
Oct 25 23:10:55 mongo-011 pbm-agent[315203]: 2023-10-25T23:10:55.000+0000 D [cleanup] got backups list: 14
Oct 25 23:10:55 mongo-011 pbm-agent[315203]: 2023-10-25T23:10:55.000+0000 D [cleanup] bcp: 2023-10-12T08:40:53Z.pbm.json
Oct 25 23:11:04 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:04.000+0000 W [cleanup] skip snapshot 2023-10-12T08:40:53Z: file "2023-10-12T08:40:53Z/cluster01/xxxx01.partners.s2": no such file
Oct 25 23:11:04 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:04.000+0000 D [cleanup] bcp: 2023-10-13T02:00:01Z.pbm.json
Oct 25 23:11:11 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:11.000+0000 W [cleanup] skip snapshot 2023-10-13T02:00:01Z: file "2023-10-13T02:00:01Z/cluster01/xxxx01.plannings.s2": no such file
Oct 25 23:11:11 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:11.000+0000 D [cleanup] bcp: 2023-10-14T02:00:02Z.pbm.json
Oct 25 23:11:20 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:20.000+0000 W [cleanup] skip snapshot 2023-10-14T02:00:02Z: file "2023-10-14T02:00:02Z/cluster01/local.oplog.rs.bson.s2": no such file
Oct 25 23:11:20 mongo-011 pbm-agent[315203]: 2023-10-25T23:11:20.000+0000 D [cleanup] bcp: 2023-10-15T02:00:01Z.pbm.json
Logs pbm-agent.service for mongo-012:
Oct 25 23:02:12 mongo-012 pbm-agent[333095]: 2023-10-25T23:02:12.000+0000 I got command cleanup <ts: 1698274931>
Oct 25 23:02:12 mongo-012 pbm-agent[333095]: 2023-10-25T23:02:12.000+0000 I got epoch {1698200847 8}
Oct 25 23:02:12 mongo-012 pbm-agent[333095]: 2023-10-25T23:02:12.000+0000 D [cleanup] skip: lock not acquired
Oct 26 02:00:02 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:02.000+0000 I got command backup [name: 2023-10-26T02:00:01Z, compression: s2 (level: default)] <ts: 1698285601>
Oct 26 02:00:02 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:02.000+0000 I got epoch {1698200847 8}
Oct 26 02:00:17 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:17.000+0000 D [backup/2023-10-26T02:00:01Z] nomination timeout
Oct 26 02:00:17 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:17.000+0000 D [backup/2023-10-26T02:00:01Z] skip after nomination, probably started by another node
Oct 26 02:00:37 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:37.000+0000 D [pitr] start_catchup
Oct 26 02:00:37 mongo-012 pbm-agent[333095]: 2023-10-26T02:00:37.000+0000 E [pitr] init: catchup: no backup found. full backup is required to start PITR
Oct 26 02:01:07 mongo-012 pbm-agent[333095]: 2023-10-26T02:01:07.000+0000 D [pitr] start_catchup
Oct 26 02:01:07 mongo-012 pbm-agent[333095]: 2023-10-26T02:01:07.000+0000 E [pitr] init: catchup: no backup found. full backup is required to start PITR
Logs pbm-agent.service for mongo-013:
Oct 25 22:57:34 mongo-013 pbm-agent[328438]: 2023-10-25T22:57:34.000+0000 I [pitr] created chunk 2023-10-25T22:52:34 - 2023-10-25T22:57:34. Next chunk creation scheduled to begin at ~2023-10-25T23:02:34
Oct 25 23:02:12 mongo-013 pbm-agent[328438]: 2023-10-25T23:02:12.000+0000 I got command cleanup <ts: 1698274931>
Oct 25 23:02:12 mongo-013 pbm-agent[328438]: 2023-10-25T23:02:12.000+0000 I got epoch {1698200847 8}
Oct 25 23:02:12 mongo-013 pbm-agent[328438]: 2023-10-25T23:02:12.000+0000 D [cleanup] skip: lock not acquired
Oct 25 23:03:38 mongo-013 pbm-agent[328438]: 2023-10-25T23:03:38.000+0000 I [pitr] created chunk 2023-10-25T22:57:34 - 2023-10-25T23:02:34. Next chunk creation scheduled to begin at ~2023-10-25T23:07:34
Oct 25 23:07:50 mongo-013 pbm-agent[328438]: 2023-10-25T23:07:50.000+0000 I [pitr] created chunk 2023-10-25T23:02:34 - 2023-10-25T23:07:34. Next chunk creation scheduled to begin at ~2023-10-25T23:12:34
Oct 25 23:12:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:12:34.000+0000 I [pitr] created chunk 2023-10-25T23:07:34 - 2023-10-25T23:12:34. Next chunk creation scheduled to begin at ~2023-10-25T23:17:34
Oct 25 23:17:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:17:34.000+0000 I [pitr] created chunk 2023-10-25T23:12:34 - 2023-10-25T23:17:34. Next chunk creation scheduled to begin at ~2023-10-25T23:22:34
Oct 25 23:22:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:22:34.000+0000 I [pitr] created chunk 2023-10-25T23:17:34 - 2023-10-25T23:22:34. Next chunk creation scheduled to begin at ~2023-10-25T23:27:34
Oct 25 23:27:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:27:34.000+0000 I [pitr] created chunk 2023-10-25T23:22:34 - 2023-10-25T23:27:34. Next chunk creation scheduled to begin at ~2023-10-25T23:32:34
Oct 25 23:32:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:32:34.000+0000 I [pitr] created chunk 2023-10-25T23:27:34 - 2023-10-25T23:32:34. Next chunk creation scheduled to begin at ~2023-10-25T23:37:34
Oct 25 23:37:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:37:34.000+0000 I [pitr] created chunk 2023-10-25T23:32:34 - 2023-10-25T23:37:34. Next chunk creation scheduled to begin at ~2023-10-25T23:42:34
Oct 25 23:42:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:42:34.000+0000 I [pitr] created chunk 2023-10-25T23:37:34 - 2023-10-25T23:42:34. Next chunk creation scheduled to begin at ~2023-10-25T23:47:34
Oct 25 23:47:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:47:34.000+0000 I [pitr] created chunk 2023-10-25T23:42:34 - 2023-10-25T23:47:34. Next chunk creation scheduled to begin at ~2023-10-25T23:52:34
Oct 25 23:52:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:52:34.000+0000 I [pitr] created chunk 2023-10-25T23:47:34 - 2023-10-25T23:52:34. Next chunk creation scheduled to begin at ~2023-10-25T23:57:34
Oct 25 23:57:34 mongo-013 pbm-agent[328438]: 2023-10-25T23:57:34.000+0000 I [pitr] created chunk 2023-10-25T23:52:34 - 2023-10-25T23:57:34. Next chunk creation scheduled to begin at ~2023-10-26T00:02:34
Logs pbm-agent.service for mongo-014:
Oct 25 23:02:12 mongo-014 pbm-agent[305773]: 2023-10-25T23:02:12.000+0000 I got command cleanup <ts: 1698274931>
Oct 25 23:02:12 mongo-014 pbm-agent[305773]: 2023-10-25T23:02:12.000+0000 I got epoch {1698200847 8}
Oct 25 23:02:12 mongo-014 pbm-agent[305773]: 2023-10-25T23:02:12.000+0000 D [cleanup] skip: lock not acquired
Logs pbm-agent.service for mongo-015:
# Nothing during this period
Additionnal infos :
cluster01:PRIMARY> db.pbmLockOp.find({})
{ "_id" : ObjectId("65399e73b2580fa52a2e0248"), "type" : "cleanup", "replset" : "cluster01", "node" : "mongo-011.vrack:27017", "opid" : "65399e73787d0cd92f116498", "epoch" : Timestamp(1698200847, 8), "hb" : Timestamp(1698275486, 2) }
# pbm config
pitr:
enabled: false
oplogSpanMin: 5
compression: s2
storage:
type: s3
s3:
provider: aws
region: gra
endpointUrl: https://s3.gra.perf.cloud.ovh.net/
bucket: xxxxxx01
credentials:
access-key-id: '***'
secret-access-key: '***'
maxUploadParts: 10000
storageClass: STANDARD
insecureSkipTLSVerify: false
backup:
compression: s2
Any guidance on this would be greatly appreciated.
Thank you.