Tuning server for very large (250GB) data sets

Hey guys. I’ve got some large datasets at work and I’m constantly struggling with performance issues. I wonder if you guys can help me analyze my MySQL configuration and offer any suggestions.

Here’s my setup: We store firewall log data for each of our clients. We split the data across four tables in a one-to-one mapping – one table (named ‘event’) stores the information that is common to every event, and the other three tables (‘event_detail’, ‘event_ip_detail’, and ‘event_icmp_detail’) hold the data that applies only to some of the events. The decision so split the data up was made before I got here, to save hard disk space for the records that don’t need those extra columns, and now we have to deal with joining the data back up when we want to use it. Eventually I want to redesign the table schema and put it all in one table, but that’s not an option for now.

We’ve got seven database servers total, each with two Xeon processors (ranging between 2.0 and 3.2 ghz on different servers, all with HyperThreading enabled), 2.5GB RAM, and two 146GB 10k RPM SCSI disks joined together with LVM. All servers are running CentOS 4 (linux). The version of MySQL on every server is 4.1.20-standard-log.

We store data for multiple clients, and our data is partitioned on a client level. Each client’s data is stored in a separate database. Each database has the same four tables inside, using InnoDB file_per_table. The larger clients (200-300 GB each) get a server to themselves, and the smaller ones (10-70 GB each) share one server.

Each of the tables in the larger databases has on the order of about 300 million rows.

As far as read vs write usage (selects vs inserts), ours isn’t really clearly biased one way or another like it is in some cases (like webservers doing mostly selects). I suppose we do a little more inserting than selecting – we’re pretty much always inserting new events. We then have a series of scripts that run once an hour and once a day to calculate various statistics based on the data. Most of the rest of our usage is based on those calculated statistics, so that’s about the last time the data itself is accessed until we delete it 45 days later.

Here’s the relevant portions of my MySQL config for one of my servers. This one has 2GB of RAM and dual 3.2 ghz Xeons w/2MB cache.

skip-name-resolveskip-lockingset-variable = max_relay_log_size=256Mset-variable = key_buffer=384Mset-variable = max_allowed_packet=25Mset-variable = table_cache=128set-variable = sort_buffer=4Mset-variable = read_buffer_size=5Mset-variable = myisam_sort_buffer_size=16Mset-variable = tmp_table_size=768Minnodb_file_per_tableinnodb_data_file_path = ibdata1:100M:autoextendset-variable = innodb_log_files_in_group=3set-variable = innodb_log_file_size=250Mset-variable = innodb_log_buffer_size=16Mset-variable = innodb_buffer_pool_size=1536Mset-variable = innodb_additional_mem_pool_size=20Mset-variable = innodb_lock_wait_timeout=1000set-variable = innodb_flush_log_at_trx_commit=1

Any ideas? If there are any other stats that would be useful, let me know, and I can post them. Thanks a lot.

Dan

In what area are you struggling with performance - inserts, selects, updates? I assume you are saying that selects are a problem.

Can you give a sample query pseudo code that is a problem?

Usually, our SELECTs are relatively fast (considering the size of the tables we’re dealing with), but INSERTs and mostly DELETEs are slow. Lately I’ve been struggling with two servers that have become an order of magnitude slower in both selects AND deletes and I have no idea why, but that’s probably a separate issue.

Basically, it’s DELETEs that cause me the most problems. We have an archive process that runs daily for each client – it SELECTs the oldest 24 hours’ worth of data from the database (in smaller blocks, of course, not all at once) and writes it to an XML file, then it DELETEs those oldest 24 hours’ worth of data from the database (also in blocks of 50,000 rows at a time). For the larger databases, where we have about 4-5 GB of data per 24 hours, the process can take 18-20 hours total (usually 4-6 hrs for the reading portion, and then 10-15 hours for deleting). So it’s usually just barely completing before the next day’s archive process starts running.

We’ve got a relatively large amount of indexes on each table, since it was the only way to achieve decent performance on our SELECT queries, and I’m sure it’s those indexes that cause the DELETEs to take forever, but it’s not really an option to get rid of any indexes, since that will just transfer the slow performance to the SELECT queries.

Here’s the table definitions for my four tables:

CREATE TABLE event ( event_id bigint(20) unsigned NOT NULL auto_increment, event_time datetime NOT NULL default ‘0000-00-00 00:00:00’, host_address varchar(15) NOT NULL default ‘’, facility varchar(20) NOT NULL default ‘’, severity tinyint(3) unsigned NOT NULL default ‘0’, message_id varchar(20) default NULL, vendor_id tinyint(3) unsigned NOT NULL default ‘0’, message varchar(255) NOT NULL default ‘’, insert_ts timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP, PRIMARY KEY (event_id), KEY idx__event__event_time__host_address__severity__event_id (event_time,host_address,severity,event_id), KEY idx__event__facility (facility)) ENGINE=InnoDB DEFAULT CHARSET=latin1;CREATE TABLE event_detail ( event_id bigint(20) unsigned NOT NULL default ‘0’, interface varchar(15) default NULL, protocol_id smallint(5) unsigned default NULL, traffic_type_id tinyint(3) unsigned default NULL, traffic_action tinyint(3) unsigned default NULL, traffic_dir tinyint(3) unsigned default NULL, policy_id varchar(15) default NULL, zone enum(‘UNKNOWN’,‘SELF’,‘TRUST’,‘UNTRUST’) default NULL, event_type enum(‘UNKNOWN’,‘TRAFFIC’,‘DEVICE’,‘CTRL’,‘ALARM’,‘WEB’) default NULL, PRIMARY KEY (event_id), KEY idx__event_detail__event_type__zone__traffic_action (event_type,zone,traffic_action)) ENGINE=InnoDB DEFAULT CHARSET=latin1;CREATE TABLE event_ip_detail ( event_id bigint(20) unsigned NOT NULL default ‘0’, src_ip varchar(15) default NULL, dst_ip varchar(15) default NULL, src_ip_desc varchar(25) default NULL, dst_ip_desc varchar(25) default NULL, src_port smallint(5) unsigned default NULL, dst_port smallint(5) unsigned default NULL, src_port_desc varchar(25) default NULL, dst_port_desc varchar(25) default NULL, PRIMARY KEY (event_id), KEY idx__event_ip_detail__src_ip__src_ip_desc (src_ip,src_ip_desc), KEY idx__event_ip_detail__dst_ip__dst_ip_desc (dst_ip,dst_ip_desc), KEY idx__event_ip_detail__dst_ip__src_ip (dst_ip,src_ip), KEY idx__event_ip_detail__src_ip__dst_port (src_ip,dst_port)) ENGINE=InnoDB DEFAULT CHARSET=latin1;CREATE TABLE event_icmp_detail ( event_id bigint(20) unsigned NOT NULL default ‘0’, icmp_type tinyint(3) unsigned default NULL, icmp_code tinyint(3) unsigned default NULL, PRIMARY KEY (event_id), KEY idx__event_icmp_detail__icmp_type__icmp_code (icmp_type,icmp_code)) ENGINE=InnoDB DEFAULT CHARSET=latin1;

(It also wasn’t my idea to store IP addresses as VARCHAR(15)'s instead of INTEGER UNSIGNED. That’s something else I’d like to change, but there’s a lot of code that will have to be modified when I do that.)

As for a sample query, here’s an example of how we might obtain the list of most common destination hosts:

SELECT IP.dst_ip, COUNT(IP.event_id) AS dst_countFROM event E USE INDEX(idx__event_event_time__host_address__severity__event_id) INNER JOIN event_ip_detail IP ON E.event_id = IP.event_id INNER JOIN event_detail D ON E.event_id = D.event_idWHERE E.event_time BETWEEN 20070601000000 AND 20070602000000 AND D.traffic_action = 0 AND D.event_type = ‘TRAFFIC’ AND D.zone = ‘TRUST’ AND IP.dst_ip_desc IS NOT NULLGROUP BY IP.dst_ipORDER BY dst_count DESCLIMIT 10;

Although, as I said, it’s really mostly the deletes that are slow, but those are pretty uninteresting. Originally I was doing something like:

DELETE event_ip_detailFROM event E INNER JOIN event_ip_detail IP ON E.event_id = IP.event_idWHERE E.event_time BETWEEN 20050101000000 AND 20070502000000LIMIT 50000;(repeat for event_icmp_detail and event_detail, then do a straightforward delete from event)(repeat until affected_rows == 0)

But after some testing I determined that it was actually faster to avoid those three joins and just do this:

SELECT event_idFROM eventWHERE event_time BETWEEN 20070501000000 AND 20070502000000LIMIT 50000;DELETE FROM event_icmp_detail WHERE event_id IN ($event_ids);DELETE FROM event_ip_detail WHERE event_id IN ($event_ids);DELETE FROM event_detail WHERE event_id IN ($event_ids);DELETE FROM event WHERE event_id IN ($event_ids);(repeat until we get no more events in back from the SELECT)

I had to raise my max query length to let me send a query with 50,000 event_id’s in it, of course, but it seems to work pretty well.

Does that help?