Creating The right database for hundreds of thousands of Blog Posts?

Let me say straight off that I do not have to much control over the
MySQL server itself, but I can say that it has fairly good capacity
and is running MySQL 5.0.19 and is on a Linux box. You can assume a
default setup.

I have a good amount of experience with MySQL, however MySQL
optimization is not one of my better skill sets. I currently have a
site (really a series of them) which employs a small amount of MySQL,
but not for the main storage engine. I need to shift to a full MySQL
solution. It’s really not a very complicated site, however there is
lots of data.

The site works essentially like a blog. In fact its a small blog
aggregator. There are posts with dates, the source the post came from
as well as a link to the original article. So what has to be stored is
relatively simple. That said there are currently about 200,000 posts
and counting (uncompressed over 1 gig of data). This could easily grow to over 1 million.

The site is split into sub domains. Each sub
domain relates to a specific topic. For example:
There is a celebrity section and the sub domain is celebrity.MySite.net.
there is a video game section and the sub domain is game.MySite.net
etc…

Currently there are 19 of these sub domains in all, but it’s growing.

MySite uses two very small tables at present. The rest of the data is
stored in gziped text files (that’s the part that has to change).
These tables are as fallows.

  1. sub domain ( stores info about a particular sub domain )

CREATE TABLE subdomain (
sd_id int(10) unsigned NOT NULL auto_increment,
sd_name varchar(255) NOT NULL default ‘’,
sd_name_disp varchar(255) NOT NULL default ‘’,
sd_title varchar(255) NOT NULL default ‘’,
sd_desc varchar(255) NOT NULL default ‘’,
sd_keywords varchar(255) NOT NULL default ‘’,
sd_feed_alt_url varchar(255) NOT NULL default ‘NONE’,
sd_today_post_num int(10) unsigned NOT NULL default ‘0’,
sd_other varchar(255) default NULL,
PRIMARY KEY (sd_id)
) TYPE=MyISAM

  1. feeds ( stores RSS feed URL’s of each of the data sources. Each row
    is related to the sub domain it belongs to via the “sd_id” field.

CREATE TABLE feeds (
feed_id int(10) unsigned NOT NULL auto_increment,
sd_id int(10) unsigned NOT NULL default ‘0’,
feed_name varchar(255) NOT NULL default ‘’,
feed_site_url varchar(255) NOT NULL default ‘’,
feed_rss_url varchar(255) NOT NULL default ‘’,
feed_post_length varchar(255) NOT NULL default ‘’,
feed_type enum(‘headline’,‘main’,‘both’) NOT NULL default ‘main’,
feed_other varchar(255) default NULL,
PRIMARY KEY (feed_id)
) TYPE=MyISAM AUTO_INCREMENT=536 ;

So what I do currently is on an hourly bases (via a cron job) read
through each of the rows in the “feeds” table and store any new posts
in the appropriate folder for the corresponding sub domain. But what I
need to do, as I have said is store this data in a MySQL DB. It will
need to be searchable, quickly accessible via sub domain, and quickly
sorted by date time. I would also like to be able to have a global
search across all sub domains if possible.

Currently there is no way to search MySite and if speed is going to
be a problem as far as search I can live with that. The key is to have
quick date sorting so that I can call the latest posts for sub domain
“celebrity” in chronological order with out to much delay.

In your case you may start with one more table for posts with sub_domain_id column so you can easily lookup data for each subdomains.

If your domains overlap you would need extra table to do many to many mapping.

A lot depends on traffic you are expecting. Ie if it is huge I would think about different tables for subdomains so it is easier to split them to different servers in the future.

You do not get much answers because question is very generic - to plat schema you need to get good feeling of what queries you’re going to execute etc.

Thanks for the advice.

I think I will split the sub domains into separate tables as you mentioned. So that leaves me with only a few possible queries that I will execute.

  1. SELECT * FROM posts_SubDomainName ORDER BY posts_date DESC LIMIT X

  2. SELECT * FROM posts_SubDomainName WHERE post_id = x

  3. SELECT * FROM posts_SubDomainName WHERE post_name = ‘x’

  4. Search the text of the posts for user entered keywords.

The only other thing I need to know now is how to set up the indexes. Based on these four basic query types, what is the best way to index this table(s).

So you’ll need indexes on (post_date), (post_id), (post_name)
plus full text search index on columns you’re going to search.

Full text search may get pretty slow if your table growth large but it is other story.