MySQL configuration for Data Warehouse, and OLAP


I found good articles on MySQL Performance blog, on how to configure MySQL for OLTP applications,
but I couldn’t find advices about OLAP configuration.

What are the key server parameters when configuring MySQL for a Data Warehouse?
For example, how much RAM should be allocated to key_buffer_size? Is it like OLTP, e.g. 50% ?
or How to configure Thread buffers?

Any ideas?

Running a data warehouse with MySQL can be tricky, but it is possible when you plan ahead of time and make some concessions where needed.

Best thing you can do is design your application or tooling to do as much processing outside of the DB as possible.

With data warehousing, you often end up with a lot more temp tables with the queries that people and analysis tools tend to write. So for that, you either need a lot of memory to hold larger and more numerous temp tables so they do not hit the disk, and/or you need a very fast disk (i.e. SSD or some other flash solution) in order to handle the temp tables on disk.

Keep your datatypes as small as possible to reduce the memory requirements. Meaning do not use a big int when a tiny int will do, avoid blobs, use enums where appropriate, etc.

Start archiving early if possible to prevent data build up unless it is required to keep everything. Querying large tables will quickly compound any issues that exist, such as poor hardware performance, bad schema design, large data quantities, etc. Build smaller data marts to handle specific use cases so you do not have one giant “data warehouse” that does everything, as that is a losing battle in most cases.

Use replicated slave DBs to handle reads if possible; that will help out a ton when done properly. This will help spread out the memory / disk requirements for the large temp tables mentioned earlier.

Where you put your memory is mostly the same for OLTP vs OLAP. Exceptions being with larger temps tables as mentioned, and you probably will end up allocating more to sorting but that is a tricky one to optimize as it can make performance worse. To start out, I would use the configuration wizard at to create a starting config. It has an OLAP option that should help you get started.

We’re creating a PHP-driven ETL script to populate fact/dimension tables in our data warehouse, once a day.
The OLTP soruce-system is in MySQL, and target OLAP is also in MySQL.

Do you suggest a “PHP-only, no-SQL” approach to handle data retrieval, processing, and aggregation?
(provided that PHP will run on the same server that MySQL is running, sharing system resources)?
What about Stored Procedures? or a mix approach? PHP+SQL?

In a similar PHP-based reporting tool, I saw a code once, where the developer was trying to aggregate a large data set,
without using GROUP BY in MySQL. His PHP solution was a 5-level nested foreach{} loop, iterating over a multi-dimensional array.
If we use pure PHP, we’ll end up with millions of rows in Arrays, which I’d think requires custom indexing algorithms, too.

Perhaps I did not get your point correctly, but I’m smelling that pure-PHP will make the ETL-script too complicated,
because of re-inventing some parts of the wheel (SQL).
Are there tools to achieve the same results, without making the code more difficult to maintain?

We’ve taken this into account. Sometimes, before loading data to memory, we map ENUM (2 byte) values to TINYINT (1 byte) to save memory.

We’ve recently started to take Archives, but sometimes, we need the entire Fact table, since the very beginning. Tables are huge, and currently, we’re using MERGE engine to split large tables into more tables. We didn’t use partitions, because it required the partition-key to be part of a unique index.

Yeah, we’re aiming to use Replication heavily.

I tried, it suggested me a number of 500 max connections, which is too high for OLAP, and while I prefer MyISAM, the wizard asks me for InnoDB settings. On the other hand, it suggested small values for tmp-table-size/max-heap-table-size (32 MB, for a 18GB server). For OLAP, I think 30 number of connections, with 128GB for tmp-table-size make more sense. I’m not sure if the wizard is OLAP-aware.

In case of data marts, which type do you suggest? Independent data marts? Dependent data marts?

With MySQL with Data Warehouse, what schema design do you recommend? denormalized? star schema? 3NF?

When it comes to processing data in the database or in your code, you just have to strike a balance that is practical. Filtering and grouping in the database should be fine in normal use cases; it’s when you start doing sub queries and more advanced SQL that you start to get bogged down. Just make sure to optimize your SQL statements up front and that will help in the long run. The basic idea is to get the data you need to work with from the database with the least reasonable amount of processing, and then doing the rest in your code.

As for the configuration tool, did you select the “OLAP” option? Either way, it’s just a starting point. Configuration plays an important role in performance, however configuration will only get you so far anyway. Things like your data model, queries, data volume, and system hardware will be what makes up the bulk of things you can tune for bigger performance gains.

As for what type of model, it generally works out to be a combination of those is the best. Going too normalized will hurt performance, so you want to again find the balance that works out best for your workload. A star / snowflake schema is what you would commonly see in a data warehouse, but again it would likely end up being a hybrid approach.

Generally I would use dependent data marts to reduce strain on the source systems. But in either case (dependent or independent), you would likely have the ETL read from a slave, so that would be minimal. If your subject areas for the data marts have a lot of shared dimensions, then the dependent model may make sense, as the creation of the dependent data mart would be pretty straight forward. If your subject areas have mostly non-shared dimensions, then it might make sense to make it an independent data mart so that the specific ETL logic for that subject area does not need to be included in the main data warehouses’ ETL for it to then be pulled later by the dependent data mart.

It’s hard to give any more specific answers without it being a consulting gig, as this kind of project is pretty in depth and would require a lot of first hand knowledge of your data and goals. The best advice I can give is to not try to build a one-size-fits-all solution, as that almost never works out. Meaning focus on specific areas that answer specific questions (i.e. the data marts), as it will be much easier to manage performance at that point.

Hope that helps some at least. =)