Performance optimisation for very large datasets


I’m developing a database backend for a network traffic analysis tool. It has to store information about every external network connection made on a company network (20000 pc’s, 300Mbits/sec of traffic). The data itself must be stored up to 1 month.

Inserts to the database are done in bursts of approx 50000 inserts every 15 seconds. There is only 1 client that performs inserts. There are only a few selects per second using only indexed colomns. There are never any modifications to the data… just many inserts and a few selects

In the end, the entire database is several 100’s of gigabytes and contains billions of rows.

Currently I have the following table:

CREATE TABLE netflow (
srcaddr int(10) unsigned NOT NULL default ‘0’,
dstaddr int(10) unsigned NOT NULL default ‘0’,
input smallint(5) unsigned NOT NULL default ‘0’,
output smallint(5) unsigned NOT NULL default ‘0’,
packets int(10) unsigned NOT NULL default ‘0’,
octets int(10) unsigned NOT NULL default ‘0’,
first int(10) unsigned NOT NULL default ‘0’,
last int(10) unsigned NOT NULL default ‘0’,
srcport smallint(5) unsigned NOT NULL default ‘0’,
dstport smallint(5) unsigned NOT NULL default ‘0’,
flags tinyint(3) unsigned NOT NULL default ‘0’,
proto tinyint(3) unsigned NOT NULL default ‘0’,
tos tinyint(3) unsigned NOT NULL default ‘0’,
KEY srcaddr (srcaddr),
KEY dstaddr (dstaddr),
KEY first (first),
KEY last (last)

Some statistics of a small table:
Keyname Type Cardinality Action Field
srcaddr INDEX 452858 srcaddr
dstaddr INDEX 769858 dstaddr
first INDEX 3849294 first
last INDEX 3849294 last

Space usage:
Type Usage
Data 270,654 KiB
Index 117,642 KiB
Total 388,296 KiB

Row Statistics:
Statements Value
Format fixed
Rows 7,698,589
Row length 36
Row size 52

The current hardware is a dual AMD opteron 240, 15K RPM SCSI disk of 300GB, 2GB RAM. Without using any optimisation (insert 1 row at a time, table stored on disk) the system gets disk-IO bound after a few hours.

What is best way to handle this large amount of data while minimizing disk access?

My ideas:

  • Get 16GB of RAM and create a table ‘netflowtmp’ on the ramdisk to do the inserts. This would not require any disk accesses.

  • What is the best/fastest way to insert into the database? (only 1 client does inserts)

  • Copy ‘netflowtmp’ to harddisk every hour for safety reasons (server crash…). What is the best way to do this? normal filecopy of the table files (what about table integrity?) or copy/move the table from the memory database to a disk database within mysql?

  • Use multiple tables (e.g. 1 table for each day of data)
    In order to minimize the disk usage, I would compress these tables with myisampack (and myisamchk -rq to rebuild the indexes). I have noticed that the table itself is compressed, but the index files are not? Are there better ways to minimize disk usage?

  • For the select, I would merge the daily tables to have it look like 1 big table. Any drawbacks to this?

All ideas/suggestions are very welcome )

With such insertion rate you surely need to avoid being IO bound, at least avoid random IO for BTREE updates.

If merge tables would work well for you depends on the queries - certain queries become much slower if merge tables are involved others not.

I would perhaps use combined approach. Something like merge table with one table per day plus in memory optimization for load process.

Something like this could be used - your load process which is activated each 15 seconds inserts data into the HEAP table using multi value insert. Two HEAP tables are used in circular fashion to act as insert buffer.

In the background script is running which does INSERT INTO last_merge_table SELECT * FROM inactive_heap_table.

This way you can minimize delay of data getting into on disk tables and yet have system adapt to spikes.

Make sure to use largre bulk_insert_buffer and key_buffer for optimal performance. You should try to have your daily merge table indexes to fit in memory.

Thank you for your reply.

I think I’m going to try this:

  • Make a ramdrive of 16 Gigabytes.

  • The main database of my application is bcrouter and is stored on disk, and a memory database bcrouter-tmpfs which resides totally on a tmpfs partition in memory.

  • The table in bcrouter-tmpfs is a normal myisam table with indexes (like in my first post).

  • Inserting of the data is done in bcrouter-tmpfs.netflow. At the moment, this is not a problem and the server can handle the 50000inserts/15sec fine.

  • Once a day, the insert-application renames bcrouter-tmpfs.netflow to bcrouter-tmpfs.netflow_todisk, and creates a new netflow table where it continues to insert.

  • After the rename, I do a copy of the .frm, .myd, .myi to the bcrouter database on disk (cp netflow_todisk.* /var/lib/mysql/bcrouter).

My guess is that this approach uses the least disk-IO possible.

It would also allow me to do much intenser data-mining on the data of the last few hours since both indexes and data are in memory.

Would there be a significant gain in performance if I use 64 bit OS?

Kind regards,

P.S. Somewhere in the process, I would also like to compress the tables. To compress it totally in memory, 16GB of ram is not enough. So I will test the compression first on disk to see if it is a bottleneck.