Hi there,
I have a MongoDB collection that has around 92 MM of documents, I need to delete around 24 MM of those documents. I have the “_id” of the documents I want to delete.
I am looking the better/fastest way to delete those records. I could have a Java application that iterate for the 24 MM of “_id” and delete one by one, OR I can have a JSON file with the “_id” and do using Mongo CLI or usingmongoimport --mode=delete --file=data.json.
Any idea what could be fastest?
Thanks in advance.
1 Like
For the record: I end-up generating 6 text files of 4 MM of lines each, “a statement per line”. e.g: deleteOne (Object of primary key _id)…
The line contains the primaryKey of a record I want to delete. Then of +95 MM I want to delete around 24 MM records.
Loading the files like:
mongo mongodb://%2Ftmp%2Fmongodb-27017.sock/myCollectionName < /tmp/deletes1.json &
…
mongo mongodb://%2Ftmp%2Fmongodb-27017.sock/myCollectionName < /tmp/deletes6.json &
of course putting each command in background. Then loading the file 1 and 2, took like 70 minutes, then loading 2 other files at the same time (file 3 and 4), and finally loading the last 2 files (5 and 6). In total a bit over 3 hours and half. Which is was fastest than Java one delete at the time, then around 7 and half hours, of course Java is doing one record at the time without using threads.
On another day, I loaded file 1, 2 and 3 at the same time then when those 3 finished, I loaded file 4, 5 and 6, that took 2.5 hours, and on another day I loaded the 6 files at the same time “6 processes” it also took 2.5 hours.
The I think if I load the file 1 ,2 and 3 and once those are finish then I loaded the files 4, 5, 6 because that doesn’t stress the system as much as 6 files at the same time, and the outcome is the same.
Cheers,
1 Like
Thank you for sharing your results and solution.
@jbiancot , I have similar requirement to delete about 140M records out of 250M, by filtering timestamp field. timestamp field have index. My mongo version is 3.6
My approach :
I will run below queries 1 by 1 from mongo shell ,
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2018-07-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2018-08-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2018-09-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2018-10-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2018-11-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2018-12-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-01-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-02-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-03-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-04-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-05-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-06-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-07-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-08-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-09-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-10-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-11-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2019-12-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2020-01-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2020-02-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2020-03-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2020-04-01”)}});
db.myLargeCollection.remove({timestamp:{“$lt”:ISODate(“2020-05-01”)}});
I tested delete operation by loading prod data on test environment, each delete took approx. 5-8 mins to delete 6-8 million records.
I have below questions:
- Does this approach is ok? or any better other approach.
- Does this delete operation, impact write operation on same collection?
- should i stop all operation during this deletion, required downtime?
- I have master - slave - slave setup, what things need to take care.
Thanks
Prabhat