large fact table (with large composite primary key?)

I have a large fact table like this:

(day date not null,
advertiser_id int not null,
publisher_id int not null,
ad_template_id int not null,
status_id tinyint not null,
partner_id int not null,
event_source_id tinyint not null,
product_id int not null,
impressions int,
partialimpressions float,
primary key (day, advertiser_id, publisher_id, ad_template_id, status_id, partner_id, event_source_id, product_id)

all queries i do against this table select using day and often some combination of the rest of the key [i think many times using advertiser_id so its second…]

my question is: this is a large table (you can see the combinations get big), and so presumably traversing the index will take a good chunk of memory. i have some other tables that also have a similar structure.

i think right now querying against these tables is flushing the innodb_buffer (i can only give innodb_buffer 4GB…) repeatedly.

my questions are:

are there any alternatives to this ‘large primary key’ approach? one reason i want it is i want to keep the unique semantic across the key.

i calculated the size of the key and data by having a ballpark of a reasonable lookback, and then multiplying each row size by the number or rows and i believe i should aim to have enough memory to keep this in memory if i want to query against it? is this the right approach here?

(i set innodb_buffer_pool_size to the max i could without making the machine swap unfortunately there are other things running on it right now…)

also am i right in thinking i can monitor either vmstat (bi/bo) [which unfortunately does not work so well because of other stuff running on the machine] and also i am thinking of monitoring ‘innodb_buffer_pool_reads/innodb_buffer_pool_read_requests’ to count the % cache misses.

sorry a large braindump - appreciate any comments.


It looks like you’re off to a pretty good start. Depending on what queries that table needs to answer, you may benefit from the introduction of some summary tables. For example, if you’re doing something like

SELECT SUM(impressions) FROM mytable WHERE day BETWEEN x AND y AND advertiser_id = z;

You’re actually reading a ton of rows when you could be reading one row per day. It could be even less than one row per day if you had weekly, monthly or yearly summaries. If you can identify the most common queries which require the reading of the most rows, and create summary tables for those, you can drastically decrease your footprint in the buffer pool.

Of course, you’re trading disk space and maintenance cost for this, but its definitely worth it. I also recommend doing your best to load data into MySQL that is already summarized or semi-summarized, rather than loading very granular data into a table and using it to populate your summary tables. Obviously you need to do the work somewhere, but its better if its done offline on a non-database machine.

Hope that helps.