I have this project where we store keywords by the hour. Then at the end of the month we sum up the data and move it over to another table for research.
In the past we just stored the keyword, and the datetime in a table labeled after the month that it was in, then just rolled them keywords and summed count over to another table for monthly research.
The problem is that the keywords where being used as the keys for cross referencing them to other data that was collected, and this as we all know is not a good thing.
While the first build works really well for what we had in mind, its just not going to cut it where we are trying to head with it.
At this point we are looking at around 4 billion unique keywords per month. So I thought about using innodb for the daily data, then using another table with each keyword. That table would hold a years worth of unique keywords(all time) worth of data to assign keys for each keyword.
The problem with this is that the system would be running a lot of inserts into the monthly table with the key, count, and date, however with the central table holding the keyword and key it would mean that the system and the users would all be touching that table and thus create a bottleneck in the system as millions of selects/inserts where run every minute on the one table with over 400 billion keywords, include fulltext searches.
I have looked at using innob, Sphinx, Partitions, and Im just left stumped at what would be the best way to provide very fast search for users on the monthly totals while at the same time reduce the load from the keywords being used as their own keys.
In a nutshell I have a data warehouse project, that Im trying to find a way to keep everyone in line with their keys without causing a bottleneck from the all the queries, updates, and inserts on the central table that would keep all the keys aligned for joins.
So any ideas on the best way to store 400 billion keywords for search, with a key for reference, and still give very fast results?