I am trying to optimise aggregation of a large transactional-type data set and am looking for any hints or tricks that would help me to manage the query time downwards.
The data is stored in a fairly standard star schema – one or more fact tables with associated dimensional tables. There are a few very minor tweaks designed to minimise the need for joins or use of functions. For instance, we’ve broken up the timestamp into integers for the date, hour, minutes, and seconds.
The challenge is that the primary fact table contains 7.2 billion records covering one month’s worth of data, and we’re trying to roll it up at a variety of temporal scales (hourly , weekly, and monthly). Since aggregation involves a full table scan we’ve avoided building indexes on the raw transactional table. Fortunately, we also don’t need to join to any of the dimensions during aggregation so at this point it’s purely a case of optimising the GROUP BY query and the I/O of the database/system itself.
We are running the database on an 8 core MacPro with 4GB of RAM and a 1TB RAID array running 10.5 (Leopard). I have made two changes to the server’s default variables for the purpose of running the queries:
[LIST=1]
[*] set global key_buffer_size=1500000000 set sort_buffer_size=1500000000
[/LIST]
For the duration of the aggregation process this system will not be doing anything else that is memory or disk-intensive.
One aggregation query currently looks something like:
INSERT INTO daily_transaction_fct ( user_id, date_id transaction_type, value_amt, transaction_amt) SELECT SQL_BIG_QUERY user_id, date_id, transaction_type, sum(value_amt), count(user_id)FROM raw_transaction_fctGROUP BY user_id, date_id, transaction_type;
Running
mysqladmin extended -i 50 -r | grep Handler_read_rnd_next
gives me output along the lines of:
| Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 10574324 || Handler_read_rnd_next | 15793446 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 5111679 || Handler_read_rnd_next | 21256091 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 7350900 |
I’m not yet sure what level of compression we’ll get with the varying time scales, so I’m not yet thinking about what tweaks we’ll have to apply when running joins against the database.
Any thoughts/suggestions would be very much appreciated.
Best,
jon