is an URL a good primary key?

Hello,

I’m designing a schema for a full featured blog aggregator. I’ll have table for articles, feed, sites, hubbubs, users’ profile and other data.

Even if many of those tables will have an URL as the best natural key, I’m not sure if it could be a good primary key as well.
URL could be very long (indeed there’s any limit, I’m going to put an arbitrary one at 255 chars) and I’m not sure what will happen with joins.

So why not choose an auto-incremented ID as the key? I would prefer to not have a key completely unrelated to the real data.

Anyway since I will have million of records in not so much time from the start of the service, I’d like to avoid to modify the database schema to change the primary key on several tables, which implies changing relations, constraints, tables for n:m relations and moreover the application itself.

Some technical details: MySQL 5.1 with InnoDB storage engine. I’ll have also a master-master replication.
The very major part of write will be done via batch script during day. I’ll have a sphinx search on the database too.

So I need some opinion: will I have performance problem using an URL VARCHAR(255) as primary key at some point? It is always better to choose an integer ID? If i choose to use an ID I have to properly configure the replication options on it, which I would not with URLs (they cannot conflict).

Thank you for any help/opinion.

Lorenzo Lazzeri

[B]spider75 wrote on Fri, 22 October 2010 15:43[/B]


So I need some opinion: will I have performance problem using an URL VARCHAR(255) as primary key at some point? It is always better to choose an integer ID? If i choose to use an ID I have to properly configure the replication options on it, which I would not with URLs (they cannot conflict).

If you will or will not have performance problems depends on the expected load of your server.

But the general recommendation is to use int’s since InnoDB uses the primary key internally to find the data records, this means that the primary key is part of every index created on that table.

These due to two reasons:
1.
There are optimizations that makes inserting of rows much faster when the primary key is a continuous incrementing sequence since there is no risk of having to re-balance the tree etc.

And since the total data size used by the primary key is calculated by:
primary key column size * nr of rows * nr of indexes
It can be a substantial sum if the primary key column size is large.

But if you on the other hand seldom insert rows in the table and you always refer to them based on the URL then using an URL as primary key would be a good thing. That is usually more of a margin case.

So the rule is: use int as primary key unless you know what you are doing. )