I’ve haven’t used MongoDB time-series collections, but I’m wondering if they might be a good fit for this use case.
The application is an e-commerce statistics product where we want to track a couple of metrics for every product, such as:
- Sales total (we’ve sold this product for e.g. $1234 in total)
- Total profit margin (sales total minus acquisition cost)
- Average profit margin (e.g. 61.4%)
- No of sold units
…etc
We want these numbers aggregated on a daily, monthly and overall level.
On a daily level, we want to retain data for the last 365 days.
On a monthly level we want to retain data for the last 48 months.
This is just to make sure that the collections don’t get too big over time.
Calculating these numbers is a solved problem, so that’s not the challenge. The challenge is how to store it in an efficient and scalable way.
Expected write frequency would be up to a few hundred times a week per product and a few hundred times per day for the whole collection.
Expected read frequency of the whole collection would be in the tens-hundreds per minute when usage spikes.
My first idea was to have a plain old collection (i.e. not a time-series collection) where each document looks like this:
{
_id: "ProductId",
salesTotal: 14352.00, // just $inc every time we sell
totalSoldUnits: 435, // just $inc every time we sell
averageMarginLast48Months: 0.614, // Can be recalculated from "months.$.averageProfitMargin.margin"
months: [
// cap: 48
{
key: { year: 2022, month: 1 },
salesTotal: 14352.00,
averageProfitMargin: {
margin: 61.08,
/* In order to recalculate this margin whenever an order is shipped or returned, we'll
need to keep all the numbers needed to compute a new average. Hence this "basedOn" array.
So for example when a new order is shipped, we'll just push a new document to the
"basedOn" subarray and then update the margin to (sum of basedOn.$.marginAmount / sum of basedOn.$.total).
The "basedOn" array should hopefully never grow too large since it'll only contain a months worth of orders. */
basedOn: [
{ orderLine: "{ordernumber}_{lineIdentifier}", quantity: 2, total: 234.34, marginAmount: 143.00 },
...
]
}
},
...
],
days: [
// cap: 365
{
date: "2022-01-01",
soldUnits: 23,
returnedUnits: 0
},
...
]
}
And then implement the capping of months
and days
subarrays using $pushSlice
.
This design would lead to fairly large documents. Is there a chance we might hit the max doc size or the max collection size? Would it be better if I moved the months
and days
subarrays into their own collections? Then I wouldn’t risk hitting the max document size, but I’m not sure how to ensure the cap of 48 (or 365) docs per product. Is there a way to achieve this (in an elegant way without a bunch of db roundtrips)?
A third option would be to use the new time series collection feature. Then I wouldn’t have to store these pre-computed numbers but could just store the individual orders/returns and then calculate the numbers on the fly every time using window functions. But would that put a big load on the cluster? Would read times scale poorly with the number of products?
I could do some experimenting by generating a ton of dummy data and try out all three options, but before I do that, I thought it would be nice to hear some learnings from people who have already walked this road.