Back to basics with a big table (65+ million records)

I have a large table (MyISAM with 65+million records) made from the logs from a security product. The logs contain:
Date, Time, IP address, Userid, Process name, File name, Event code, Sub code

I then do detailed queried against this data for research. It is not networked so only ever one user - me. However it has gone very slow for some of the queries I am using now. I have added a column with the MD5 hash of the Process name so that I can do faster queries on process name.

I want to start again, however I can not see a logical way to split the table or produce other tables. Any advice? This must have been done before with things like web log tables? For instance the URL field is very similar to my process field - hence the MD5 hash approach. (Process field eg; C:\Windows\System32\dhdfgg.dll )

I have several fast machines available and I wondered how I could split the table across several servers? I get bound on both CPU and/or READs depending on the query. I was hoping I could split the work across several machines?
Thanks very much

It’s probably a little early to jump to multiple servers.

I’ve got some questions:

  • How long is too long? A few seconds? 10+ minutes?
  • What queries are you doing and what does the EXPLAIN say?
  • What does SHOW TABLE STATUS say for this table?
  • What does SHOW CREATE TABLE say for this table?

Regardless of those answers, I have some suggestions:

First, I’d shrink any columns you can shrink. Example: IP Address is a 32-bit int, if you’re storing it as CHAR(15), you’re wasting 11 bytes per row (700+MB, in your case).

Instead of an MD5 hash on things like process name or file name, create a table for process names and another for file names, each with an auto-incrementing primary key. Then your table would become something like:

Date, Time, IP address, Userid, Process_name_id, File_name_id, Event code, Sub code

This makes aggregate operations much more efficient, but it will probably increase the cost of sorting by process name or file name.

Your goal should be to have a fixed width table - that means no nulls, text fields, varchar fields, etc. Then you will have the fastest possible layout.

I can give you more suggestions if you answer some of the questions above.

Thanks for the answers.

The queries vary but a starting one might be:

SELECT DISTINCT(process), COUNT(userid) FROM table GROUP BY 1

Without an index it takes 24 hours +, but then adding an index takes a long time as well.

I like the idea of creating other tables, with things like process name but I am not sure how to do that given they are not necessarily unique and I would need to maintain the ‘process’ table before I could populate the main log table? Or don’t I care because of the auto-incrementing primary key in those tables?

So, add process names to the process table and then load the main table with a query that includes a select from the process table?