Home Brew Mysql Query Analyzer - Advice Needed

Hey Guys,

I’m in the middle of designing a custom Mysql Query Analyzer to use on our own servers, and once/if I get it stable enough I’ll be releasing it open source also.

I have done a simple workbench mockup of the design of the script, and I would like some advice on what I should watch out for, and changes that should/could be made to make things more efficient.

Explanation:
The “queries” table is the primary table in the whole diagram, the table is primarily a “mapper” between itself and the databases, query_binds and query_templates tables, which are explained a little more further down.

Tables Explained:

  • query_templates - This table is designed to store a query in a format with placeholders that match up to binds, for ex:
    SELECT * FROM table WHERE column = ‘?’ where the ? is stored seperately from the query so they can be normalised (and aggregated).

  • query_binds - This table is designed to store all data for the query templates, so a query_template and a query_bind make up a query that can be ran

  • queries - This table stores the every queries execution details, also acts as the primary mapper table to fit together the query_templates, query_binds and databases tables.

The rest of the tables should be fairly self explanitory.

The Problem:
As you probably already know the general query log (our initial source of query data) fills up amazingly quickly, and we will quickly have hundreds of millions of records in our queries table. My concern is, MySQL will be unable to handle this with the way I have designed the tables. I however do not know how to design this any better than the normalised state you see it in.

A example query (off the top of my head) may look like:

SELECT sum(q.exec_time) as total_exec_time, sum(q.num_rows) as total_num_rows, sum(q.exec_time) / count(q.exec_time) as avg_exec_time, sum(q.num_rows) / count (q.num_rows) as avg_num_rows, qtemp.query, qtype.name as query_type, databases.name, MAX(q.timestamp) as latest_run, MIN(q.timestamp) as first_runFROM queries as qLEFT JOIN query_templates as qtemp ON q.query_template_id = qtemp.idLEFT JOIN query_type as qtype ON qtemp.query_type_id = qtype.idLEFT JOIN databases as dbs ON dbs.id = q.database_idWHERE dbs.name LIKE ‘%example%’ AND qtype.name LIKE '%example2%'GROUP BY q.query_template_idORDER BY q.timestamp DESC

As you can see, it is important that the databases are all called within the same statement as WHERE and ORDER BY statements will be used as per the users request on multiple different bits of information.

Any constructive feedback is welcome, this is only on the drawing board at the moment and I won’t be starting development until another project has been completed, so I have plenty of time to alter things whereever neccesary. I’m guessing partitions, or seperate tables would be a good solution, but I wouldn’t know where to start.

Also, if anyone from Percona reads this, I’ve bought your book a while back but haven’t had a chance to get to reading it yet, if this sort of thing is covered in it would you be able to point me to a relevant location?

So you’re monitoring multiple servers, which means you’re in an environment which needs to distribute the work load across multiple servers. Even under-utilized low end boxes could be handling 100 queries per second, no sweat. So now, from a single machine, we’re already talking 8,640,000 queries a day. Easily more than that on a more powerful and more active box, possibly even 5x-10x more. And based on your data model, you’re assuming there are several severs. So it’s not hard to get into a situation where your queries table is growing by 10-100 million rows a day.

That’s clearly a problem. Personally, I’m not going to use a monitoring tool that requires more storage than the production system it monitors.

Also, consider that now you must be doing an insert into one server for every select in several. Even if you’re batching them up, you’re probably going to struggle to keep up with the incoming query logs. You’ll get behind and probably stay behind, unless you have lulls in the query load which allow you to catch up.

Lastly, if you manage to load and store the data somehow, your reporting queries will take forever to execute. What was the slowest query last week? You need to pull practically a billion rows to answer that question, and that will take time, even if those rows are already in memory.

So basically, if you’re storing individual queries, you have a scalability nightmare. But you probably don’t need to be storing the queries individually. It all depends on what questions you’re trying to make it easier to answer. Some of those questions will be better answered using the log file directly. However, if you can anticipate some of the common questions, you can make this problem a lot easier. For example, if you’re really interested in performance trends, maybe its enough to aggregate each query’s performance by day or week. If you only store those aggregates in the database, you’ve saved yourself quite a bit of storage, quite a bit of load time, and answering the question you’re interested in just got significantly faster.

This does require that you write a program to read the logs and aggregate them, but this is much easier to solve than all the problems I’ve mentioned. The downside, of course, will be that you give up ad-hoc querying, but if you really understand the use case, it’s well worth the benefit.

Just for completeness sake, you should take a look at a few other technologies which might be applicable to the problem you’re trying to solve. The first is rrdtool, which is used by many open source monitoring tools to store and aggregate data (if you’ve seen NagiosGraph or Cacti, it’s what they use). It’s probably not what you’re looking for, but you might want to check it out anyway.

Another thing to consider, dare I say it, is column-oriented databases like MonetDB or Vertica. I don’t have enough experience with either to say much more, but the concept behind the column store fits most warehousing problems fairly well.

That was a little long-winded, but I hope it helps.

Vgatto, this is primarily being made to mimic the functionility of the mysql query analyzer ( http://www.paragon-cs.com/wordpress/wp-content/uploads/2009/ 02/query_analyzer_1.png) (Mysql.com currently down from my location).

They use the mysql query proxy, I was thinking of using the general query log instead and parsing the queries from that.

Also in order to get around the problems such as

[B]Quote:[/B]
What was the slowest query last week? You need to pull practically a billion rows to answer that question, and that will take time, even if those rows are already in memory.

Surely an index and partition on the date (daily partition) would suffice for this task? Mysql.com must do something similar?

The reason I have not been aggregating the data is due to as you say the flexibility of the data.

The databases will fill up quickly as you say, we currently only have 1 server (the servers table is there for easy to expand in the future), but it is doing an average of ~1k queries p/second.

Note: This would only ever be used in a production environment for debugging temporarily, never permanently on, probably at most a couple days at a time.

Also note: The query_binds table is only made for storing a limited amount of sample data for each query (and even the data stored is limited, ie no data over a certain length would be stored to deter the storage of BLOB etc data). This data would be used so you could execute a query as a test at a later date.

If anyone knows a similar software model to the Mysql Query Analyzer without having to signup for a $3k yearly fee, then please let me know. (Note I want a GUI not the Percona tools, that are fantastic but not as easy to use). If not i’ll keep on with this project after my current )