120million records - scaling up/out advice?


I am involved in an effort to make the worlds biodiversity information (plants, animals etc) freely and publically available. Part of this effort involves a central index of the information available on the web collated into a single searchable database, which runs on mysql (http://data.gbif.org). Being an index the data does not partition easily due to the combinations of filters we offer, and we are beginning to hit performance issues. The filters can be seen on the “occurrences” section of the portal.

Our DB is currently running on myisam, single server (2 proc and 5gig ram) with a DB of 90gig size. The core tables are

  • occurrence_record 120million rows, 32gig index size
  • taxon_concept 20 million rows and 5gig index size

Most queries involve a join on these 2 tables.

The problem is that we anticipate doubling the data size in the short term and within 3-5 years 10x larger, and thus we are looking to scale.

Are there any good papers on scaling strategies? Is our only option really to move to clustering? We can’t partition easily as whatever strategy for splitting the data means some queries will touch all data shards.

Anyone claim to be a large DB specialist and want to be involved as a mentor on an open source project to help mobilize biodiversity data? We could do with help on scaling and tuning…

Many thanks for any thoughts


I see it is going down a lot at the moment due to a memory issue in the web app - [URL]http://aoraia.gbif.org/portaldev/welcome.htm[/URL] is a dev version of the same application

What kind of disk setup do you have?
Type of disk?
Nr of disk?

Does all queries use proper indexes?

As for helping out I could be interested in being a bit of a mentor if you want too.

I think you can try to use Sphinx search engine for your search tasks. It supports clustering (when you’d like to use it, it will be there) out of the box and in many cases it shows better performance on large datasets comparing to mysql.

Many thanks for the replies!

Firstly, let me say that I am not a hardware guy at all… I was involved in the schema and tuning but not the hardware, so I asked getting this response:

"We’re using two RAIDs actually. The actual data from MyISAM is stored on the
“locally attached” drives: a ServeRAID 6i from IBM with 4 x 73GB/15K RPM in
RAID5. The strip size of the RAID was chosen the largest possible (32KB if I
remember well) and the mkfs stride parameters were done accordingly; The
ext3 is of type “largefile4”.

The indexes sit on a DS400 box with 14 x 73GB 10K drives. The box is hooked
via two optical interfaces directly to the DS400 mainly because we were
“cheap” enough not to have an optical switch. So we only have two servers
connected to the box. The DS400 is divided into two partitions so that gives
you 6 disks and 7 disks (one is a hot spare). Even with the 10K disks the
box is clearly faster than the local 6i controller and that’s why we’ve put
the indexes on it? Don’t really remember. Configuration wise the DS400 is
pretty much the same: largest possible stripe (64KB) and mkfs accordingly.
Both devices run “write back” type of caches. "

We are getting a mysql consultant next month for a week to offer advice on immediate hardware purchasing and to help us plan to scale for 1000million records.

We also are testing the same DB with both myisam and innodb - we are able to run a read only version by taking snapshots and importing, but ideally we’d love to go full real time of harvesting and processing data - so I presume for locking reasons we need innoDb right?

I will look at Sphinx too

Many thanks and Sterin, if you are serious, I’d love to show you the schema, detail more of what we do and get your input. That goes for anyone who wants to offer any input on the DB side. We are based in Copenhagen by the way, so not so far from you Sterin.

OK I guess that it looked like a good start until you hit the roof.

So basically you for one server have:
4x15k disks RAID5 for data 128MB cache (serveRaid 6i)
6x10k disks RAID5 for index 256MB cache (default DS400)
(since it shares the DS400 with another server as I understand it)

Some more questions:
Just to make sure, is the bottleneck CPU or IO?

The performance problems you have are they read only (since I got that feeling from your post)? Or are they writes to?

If there are writes too, what is the ratio between reads/writes?

How many concurrent queries are we talking about at peek or is each query too slow even if you run it when nobody else is using the server?

Some assumptions made by me, please correct me if I’m wrong:
You are running a 64 bit OS.

The bottleneck is IO since that is what it usually is on large DB’s.

Some remarks (although some of these recommendations might change due to your answers on the questions above):
Get MUCH more RAM than 5GB.
RAM is used as cache to avoid disk reads.
Since disk reads are the most expensive thing there is for a DB server you want to avoid it at all costs.
With your amount of data I would suggest at least 32GB RAM, preferably even more.

Consider RAID 10 instead of RAID5 due to faster random reads and writes (but it depends on the read/write pattern your DB is producing) and even more disks to distribute the load over.

As for scaling up/out I usually think it is best to keep it simple and run with one (or two servers where one is for backup reason) pretty far to avoid a lot of extra work.

To be able to scale out you need the DB to be pretty static and not many writes.
Because each server needs to perform all writes and you also have a delay in distributing the writes to the other servers.
So if you have a very write heavy or time critical DB then scaling out is basically not an option.

In your case it sounds like it is pretty static DB and then scaling out is an option but yet my opinion is that you save yourself a lot of work by keeping the nr of servers down.

Hi Sterin

I will reply by email with full details of what we are doing.

Many thanks!