split huge tables or just query differently?

This forum looks like a great resource - thanks for creating it.

I have a challenge that no doubt has been faced by others before. I have developed a rather complex visitor/event/click tracking script that logs to multiple tables. This is not just simple stat tracking – so I must normalize the database and have each table related by primary key IDs, in many-to-one and many-to-many relationships … hence lots of JOINS.

The problem is, the tables over time have gotten rather large: up to 1.4 GB (22 million rows) and increasing relatively quickly right now. When I try to view some stats, my connection with the server just gives up or takes forever and chews up a lot of CPU processing power. I have INDEXED everything I can. Buffers have been increased. I’ve used EXPLAIN and even queries that are fully indexed take too long or don’t run at all if it involves a huge table. This is on a dedicated server with 2 GB of RAM and Intel Celeron 1.7GHz (not fast, I know).

Summary tables are an option for some stat reports but not for others.

So it appears that this is a “big table” problem. Although from what I’ve seen here, a 1GB/12 mil row table shouldn’t be any problem for mysql? However, I can’t just let the tables grow forever…

I see two solutions, aside from getting better hardware:

  1. Try to split all the tables up by date, ie, start logging to a new table when it gets to a certain size.

or

  1. Better optimize my queries somehow.

The problem with #1 (splitting tables), is that I’ve got primary key ID’s starting in row #1 (date of first log) that may need to be referenced today by one or more other tables. (One reason why archiving isn’t a solution). Assuming I can figure that out, there’s the problem of SELECT having to span multiple tables if they contain data needed for the date range requested. UNION may work here but again, we’re talking multiple split tables being UNIONed, joined and selected from.

Regarding #2 (better queries), would it be better to try to get a subset of each huge table first? Like by date range … and create some temp tables, then JOIN those? Or is that what mysql does anyway if part of the SELECT includes a WHERE datey?

Hope this makes some sense.
Any ideas?

Hi,

[B]bluem wrote on Wed, 08 November 2006 09:53[/B]
I have INDEXED everything I can.

Maybe this is the problem? If the queries are recurrent and there is a small number of different conditions in WHERE clauses or JOINS, you should be ok with 2-3 indexes. Have you checked the state of queries that are running? My guess is that inserts or updates are locking the tables for a releatively long time (due to a huge number of indexes needed to be updated).

Moreover, I think you should give InnoDB a try , as it does a row-level locking and multiple inserts/updates/selects can go simultaneously. However, you should really reconsider the schema of the database. If I were you, I would do archiving anyhow and use small database with as few indexes as possible for run-time inserts and updates, and a big one for the selects. Of course, decisions depend on the actual needs of the application.

Thank you for the reply.

[B]inner wrote on Wed, 08 November 2006 01:28[/B]
If the queries are recurrent and there is a small number of different conditions in WHERE clauses or JOINS, you should be ok with 2-3 indexes. Have you checked the state of queries that are running? My guess is that inserts or updates are locking the tables for a releatively long time (due to a huge number of indexes needed to be updated).

When I say I’ve indexed everything, I mean on the fields that are being used in WHERE clauses … not really ‘everything’.

You could very well be right about locking – mysql is very busy inserting rows constantly, every pageview, etc. Maybe 50,000 new entries a day, I don’t know for sure.

I’m not sure how to check the state of queries that are running (forgive my ignorance). How is this done?

On that note, how DOES mysql do inserts/updates to a table that is at the same time being SELECTed from? Is it one or the other, or can they be done at the same time?

I like your idea of “use small database with as few indexes as possible for run-time inserts and updates, and a big one for the selects”. I might just have to go that route somehow. (Did you mean separate tables, or did you actually mean separate databases?)

Thanks again.

Awesome, you’ve clarified everything. Thanks for the links. I will definitely study Peter’s presentation.

I don’t have slow query logging enabled (don’t think so anyway) - if it was logging, is there a standard way of pulling up the log?

FYI, status says:
Slow_launch_threads is 0
Slow_queries is 67

Hi,

Your problem is typical over time tables become large which first make them to consume more CPU due to large index ranges etc and then become disk bound which causes dramatic shutdown.

In your case think what queries you’re executing actually need to do. If query will need to fetch random 10000 rows from the disk which are all located in random locations and nothing is likely cached the query will be slow and you will need find a way to do same thing some other way.

More detailed answer would need more information.

I’ve gotten some good ideas as to what to try now – appreciate the response. I read over your mysql presentation (link from “inner”) and it’s very valuable. My only wish is that it had even more detail to fill between the main bullet points for newbies like me ) Nevertheless, I printed it out and have gleaned some great tips from it. Thanks!

Hi bluem,
Sorry to butt in here, it sounds like you have got it pretty much sorted.

Have you considered splitting the main data tables by time period, say 3 month chunks into identical structure tables, with names based on the period - e.g. q1_data, q2_data, etc.

That way your analysis application could be clever as to which sections of data it analyses, and could build union queries or use a MERGE type tables to combine those periods into more usable chunks of time - years, etc.

This way you can split the analysis data onto multiple disks if available - I have found that most of the bottle-necks we are experiencing are disk based rather than anything else - our ‘database’ comprises 400+ databases each with 100+ tables - over 800Gb total. Mainly select queries, with multiple joins which in most cases cross databases (I still find it amazing how mySQL handles all this without breaking into a sweat!)

I also agree with inner’s first post, a small table used to collect the new data which is flushed into the analysis tables say once a day would certainly stop the problems you seem to be experiencing with locking. I know nothing about innoDB, so can’t comment on how that would handle things better / differently.

Hope that helps in some way

Has.

Thanks for the input, Hasgaroth.

What you suggested is what I had pretty much decided upon, so it’s good to see an affirmation of the idea. Somehow the data has to be split by date. I was thinking that since traffic levels can vary dramatically over time it may be better to split tables up by size (# of rows), and have a “master” table that records the start and end dates for each split table so queries would know which tables to join.

Anyway, I’m having to rewrite my entire application. But that’s OK. It’ll be a lot better for it.

A question: how do you do joins across multiple databases? I thought a query in PHP, for instance, always had reference to one database. I’d rather have a separate db for my traffic/stats but on occasion, it needs to be able to join with other tables, such as customer orders.