I am involved in an effort to make the worlds biodiversity information (plants, animals etc) freely and publically available. Part of this effort involves a central index of the information available on the web collated into a single searchable database, which runs on mysql (http://data.gbif.org). Being an index the data does not partition easily due to the combinations of filters we offer, and we are beginning to hit performance issues. The filters can be seen on the “occurrences” section of the portal.
Our DB is currently running on myisam, single server (2 proc and 5gig ram) with a DB of 90gig size. The core tables are
- occurrence_record 120million rows, 32gig index size
- taxon_concept 20 million rows and 5gig index size
Most queries involve a join on these 2 tables.
The problem is that we anticipate doubling the data size in the short term and within 3-5 years 10x larger, and thus we are looking to scale.
Are there any good papers on scaling strategies? Is our only option really to move to clustering? We can’t partition easily as whatever strategy for splitting the data means some queries will touch all data shards.
Anyone claim to be a large DB specialist and want to be involved as a mentor on an open source project to help mobilize biodiversity data? We could do with help on scaling and tuning…
Many thanks for any thoughts