To normalise or not to normalise?

I am designing a quite simple web app, but I am trying to design it so that it will scale well.

Basically I have an item table and a user table. A user may have many items and each item has exactly one user.

Every time an item is retrieved the corrisponding user must also be found. Currently, when listing items, the users username is retrieved using an inner join on the user_id field that is in both tables.

Reading some stuff on high-performance mysql and website design, it suggests de-normalization is something you should consider.

So I could de-normalise this schema to insert a ‘cached copy’ of the user’s username in the item table.

This will save me a join in the most common query in the application. This is good.

However, I will face the issue of keeping the item.username field upto date. So, if a user changes their username, all their items will need to be updated. Given mysql’s table locking rules (and the fact that I want the app to scale well in a replicated setup), is this a good idea?

I do not think you need denormalization in this case.

Usually you can do separate lookup for the user which later can be cached with good hit ratio.

denormalization helps in other cases.

Interesting idea. So istead of an inner join, might I be better to pull out the user_id’s from the selected items and do a second query using the IN operator on the list of user_id’s?

Although, I guess a list of user_id’s and the IN operator might not generate cache hits so much?

Well. This sounds like a bit more complicated query than I would expect.

Basically I’d suggest to try both layounts with simple test and see. It is the case than trying is better than theoritizing :slight_smile:

Good point…

I have quickly written a benchmark script in PHP and thought I would post the results.

3 methods:

  1. De-normalised: SELECT SQL_NO_CACHE * FROM item ORDER BY score DESC LIMIT ;

  2. Normalised: SELECT SQL_NO_CACHE * FROM item INNER JOIN user ORDER BY score DESC LIMIT ;

  3. Split: SELECT SQL_NO_CACHE * FROM item ORDER BY score DESC LIMIT ; SELECT username FROM user WHERE user_id IN ().

Results are against varying LIMIT value;

Limit: 20 40 60 80 100 120 1401 0.08 0.13 0.19 0.25 0.29 0.34 0.40 2 0.12 0.23 0.33 0.42 0.52 0.62 0.733 0.15 0.27 0.37 0.48 0.6 0.71 0.81

So, de-normalised is a bit faster, but its not really going to make THAT much difference. Method 3 is slower than 2, suggesting, once again, its best to throw it all at mysql and let it worry about optimisation…

Database setup: There are 100,000 items, 10,000 users. All records randomly generated with evenly distributed random values. Times are in seconds to run the query 100 times. Query cache disabled (I hope).

Thanks for sharing results.

Your benchmarks however oversimplify things. Normally in Web application login step is separated - so you read and check user access before you start doing other things, this means you should already have user_id available before you start the query.

But anyway in your application it may be different.