Optimising config for aggregating very large (7.2 billion records) tables

I am trying to optimise aggregation of a large transactional-type data set and am looking for any hints or tricks that would help me to manage the query time downwards.

The data is stored in a fairly standard star schema – one or more fact tables with associated dimensional tables. There are a few very minor tweaks designed to minimise the need for joins or use of functions. For instance, we’ve broken up the timestamp into integers for the date, hour, minutes, and seconds.

The challenge is that the primary fact table contains 7.2 billion records covering one month’s worth of data, and we’re trying to roll it up at a variety of temporal scales (hourly , weekly, and monthly). Since aggregation involves a full table scan we’ve avoided building indexes on the raw transactional table. Fortunately, we also don’t need to join to any of the dimensions during aggregation so at this point it’s purely a case of optimising the GROUP BY query and the I/O of the database/system itself.

We are running the database on an 8 core MacPro with 4GB of RAM and a 1TB RAID array running 10.5 (Leopard). I have made two changes to the server’s default variables for the purpose of running the queries:

[LIST=1]
[*] set global key_buffer_size=1500000000 set sort_buffer_size=1500000000
[/LIST]

For the duration of the aggregation process this system will not be doing anything else that is memory or disk-intensive.

One aggregation query currently looks something like:

INSERT INTO daily_transaction_fct ( user_id, date_id transaction_type, value_amt, transaction_amt) SELECT SQL_BIG_QUERY user_id, date_id, transaction_type, sum(value_amt), count(user_id)FROM raw_transaction_fctGROUP BY user_id, date_id, transaction_type;

Running
mysqladmin extended -i 50 -r | grep Handler_read_rnd_next
gives me output along the lines of:

| Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 10574324 || Handler_read_rnd_next | 15793446 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 5111679 || Handler_read_rnd_next | 21256091 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 26367478 || Handler_read_rnd_next | 7350900 |

I’m not yet sure what level of compression we’ll get with the varying time scales, so I’m not yet thinking about what tweaks we’ll have to apply when running joins against the database.

Any thoughts/suggestions would be very much appreciated.

Best,

jon

If you are only using your fact table to perform aggregation, and you never write regular reporting queries against that fact table, then the best optimization I can suggest is to get rid of your fact table.

Consider what you’ve got. An unindexed table isn’t really any different than a log file. There’s nothing that MySQL is doing for you in terms of organizing disk based storage if you’re not using indexes. As a matter of fact, unless your data started out in some bloated XML file, you are actually increasing the physical size of the data by loading it into MySQL. That goes double (or three times) if you’re using InnoDB.

The only thing MySQL is doing for you is performing your aggregate operations. If you could do this yourself on the original data, you will at the very least get back the time spent loading data into MySQL.

Pre-processing has a number of advantages. First, the work doesn’t need to be done on your database machine. This will free up disk space, CPU cycles and eliminate the cache pollution caused by summarization. Second, you can make intelligent decisions about how to process the data that MySQL can’t. If you recognize data orthogonality and can introduce parallelization you will be able to out-perform MySQL by a large margin.

There’s also a lot of tools out there to help you do this kind of work, most notably Hadoop and Amazon’s EC2. Also simple UNIX tools like grep, cut, sort and uniq can do quite a bit of work for you too.

Clearly every data set is different, but I feel like I’ve gotten a lot of mileage out of this approach. All you need to do is put in the work.

I hope that helps.

Well if the query is going to be into disk because its going to be so large, I would suggest mounting a temp directory directly to memory so mysql will do its operations off disk.

Otherwise, you might be better off breaking the import into smaller pieces, say by day, and then iterating over those pieces. If you are on 5.1 I would suggest partitioning by date id, then having an index on transaction type if the date is too large.

Thanks, vgatto and brcavnagh for your thoughts!

I certainly take vgatto’s point about using Unix utilities and that’s actually the approach I would normally employ when the data needs cleaning. However, in this case I was supplied with the db files (sans indexes) so it seemed to make the most sense to just keep working in MySQL rather than dumping, processing, and reloading.

I have set up both the source and aggregate tables with partitions (the source fact table has 5, the destination tables have 3 on the basis that they’ll be aggregated and so the partitions will remain at roughly a constant size. This not only seems to improve performance (even without indexing), but it also gets around an interesting bug in select count(*) where MySQL apparently assumes a 32-bit int.

brcavnagh – what did you mean about ‘mounting a temp directory to memory’? Are you suggesting creating an additional mount point to which I attach some extra storage? If so, can you outline the code that would perform this operation on OSX (since I’m fairly sure it doesn’t use /etc/fstab)?

Aggregating to the monthly level (so without the additional grouping on date) completed in 12 hours which is, to be frank, faster than I expect that I could code something up, test it, and run it. It was actually a fairly pleasant surprise and the machine isn’t in constant use so I can give MySQL free reign of the system resources.

I moved on to aggregating by date as well, and this query has been running for nearly 24 hours (I may have made a mistake by also trying to simultaneously start building an index). I may go back and have a look at Hadoop if I can’t get this query to finish in the next 24 hours.

Yeah the command will look like this

mkdir -p /tmpmysql
mount -t tmpfs -o size=2048M,mode=0777 tmpfs /tmpmysql

then change the my.cnf to re boot and test out. The size of the temp fs will have to be probably larger than the size thats specified since we are talking about gigs of rows. But it should work and speed things up quite a bit.

I am not sure how much space you will require or how to automount on OS X but I found this article I will be reading up later
[URL]sial.org - This website is for sale! - sial Resources and Information.

I really do suggest though, breaking the query up into smaller pieces with a script. Large queries like are problematic, in that if there is a problem, you lose can lose a day of work for an asinine problem that may be out of your control.

Btw you do realize you are missing a ‘,’ after date_id?
INSERT INTO daily_transaction_fct (
user_id,
date_id
transaction_type,
value_amt,
transaction_amt
)

Sorry about that, I just read that you only have 4GB on that server. Yup, break that query up into smaller pieces or get more RAM. It definitely sounds your project hardware is under-spec’d if this is what you need to do.

Yeah, we’re waiting for our new server to be upgraded to 32GB of RAM and 8TB of diskspace. My thinking was that this developer machine was underutilised and I’d still need time to build the tables and indexes so why not get a jump on it now?

Thank you for the /tmp code – that looks like something worth implementing on the production server as well.

jon

P.S. Sorry for the sloppy SQL re-write. The real query didn’t have this issue. )

Well I think what you have is fine, a 1 TB mac is a great machine even with just 4GB. The issue you are running into is that what you are doing is beyond the resources you have. I am positive with some generous mysql settings, the temp directory in memory and iterating into smaller steps that you can push into memory, you should be able to turn this into a 2 hour process instead of a 12 hour problem.

Even with the new server getting 32 gigs … 32 GB select … not sure what will happen even if the math works out.

Best

Have you tried using InnoDB and making your PRIMARY KEY on (user_id, date_id, transaction_type)?

You will still have to read the full dataset for your SELECT query, but InnoDB stores everything by PRIMARY KEY order, so your query should run about as fast as your disk(s) can go.

I might add an additional column to the end of your primary key if (user_id, date_id, transaction_type) is ever not unique.

If you follow Mark’s advice, just be aware that changing the primary key on an InnoDB table of this size will take a significant amount of time and disk space. Any secondary indexes you have will also have to be altered to reflect the new primary key, and will increase in size correspondingly. If the table started out as MyISAM, you will also be increasing the size on disk by a factor of 2-3x when you switch to InnoDB. If you’re always needing to scan the entire table, this difference in size may offset any performance gain you get from the clustered index.

So, having done a little digging is seems that OS X doesn’t support the tmpfs format. The closest you can get is a ramdisk via hdid, but that seems to have some severe performance penalties.

I haven’t been able to determine whether fink or some other mechanism might allow me to extend the Darwin code base or kernel to support this method. Any ideas?

I considered InnoDB, but it felt that a) I didn’t need to worry about primary key integrity; and b) I wouldn’t usually care if the entire table was locked as this is a write-once/read-multiple scenario. So MYISAM seems like the better approach for the time being.

Once the ‘real’ server comes online I have been considering whether to migrate to PostgreSQL since this transactional data has a geographic component that we want to explore in more detail (and PostGIS, with its ArcGIS support, would be a good way to do this). However, for the time being the performance of MySQL has been more than good enough with minimal tweaking.

Thank you, everyone, for your suggestions and thoughts.

[B]jreades wrote on Mon, 06 April 2009 12:26[/B]
I considered InnoDB, but it felt that a) I didn't need to worry about primary key integrity; and b) I wouldn't usually care if the entire table was locked as this is a write-once/read-multiple scenario. So MYISAM seems like the better approach for the time being.

The major advantage InnoDB will give you is the ordering on disk by primary key. If you select an appropriate primary key, it can make some select queries very fast if they access the data by the same groupings. Of course, there are tradeoffs. InnoDB requires more tuning than MyISAM, but can actually perform much faster depending on the data usage patterns, even with inserts.

But in this particular case I don’t see any benefit in changing to InnoDB. Since recreating it will, as vgatto mentioned, take a long time.

Mounting /tmp on a RAM disk is not either something I think will benefit you.
Since if your sort_buffer_size (as you already have configured ) is large enough, mysql will never create any temporary files in the /tmp directory since they are kept internally in RAM in the sort_buffer.

Some questions though:

  1. Are you going to perform more than one query like this on the same data?
    If you answer no then the only possible way forward to speed this up is raw power in the hardware (with the exception of some possible read_buffer settings in MySQL and OS).
    Since any manipulation of the table structure will take about the same amount of time as for one query.

If you on the other hand are going to perform this query more than one time on the same data you can instead create a index on the columns (user_id, date_id, transaction_type, value_amt). Since the value_amt is part of the index, then all columns part of the query is in this index and then MySQL doesn’t even have to read any data from the table since everything exists in the index. Hence the entire query can be solved by a range scan of the index.

Alternative if you have a lot of more columns in this data you can use the “myisamchk -R=yourIndex yourTable” to sort the entire table in the group by index order, result is about the same as the suggestion with InnoDB.