Hello,
I’m developing a database backend for a network traffic analysis tool. It has to store information about every external network connection made on a company network (20000 pc’s, 300Mbits/sec of traffic). The data itself must be stored up to 1 month.
Inserts to the database are done in bursts of approx 50000 inserts every 15 seconds. There is only 1 client that performs inserts. There are only a few selects per second using only indexed colomns. There are never any modifications to the data… just many inserts and a few selects
In the end, the entire database is several 100’s of gigabytes and contains billions of rows.
Currently I have the following table:
CREATE TABLE netflow (
srcaddr int(10) unsigned NOT NULL default ‘0’,
dstaddr int(10) unsigned NOT NULL default ‘0’,
input smallint(5) unsigned NOT NULL default ‘0’,
output smallint(5) unsigned NOT NULL default ‘0’,
packets int(10) unsigned NOT NULL default ‘0’,
octets int(10) unsigned NOT NULL default ‘0’,
first int(10) unsigned NOT NULL default ‘0’,
last int(10) unsigned NOT NULL default ‘0’,
srcport smallint(5) unsigned NOT NULL default ‘0’,
dstport smallint(5) unsigned NOT NULL default ‘0’,
flags tinyint(3) unsigned NOT NULL default ‘0’,
proto tinyint(3) unsigned NOT NULL default ‘0’,
tos tinyint(3) unsigned NOT NULL default ‘0’,
KEY srcaddr (srcaddr),
KEY dstaddr (dstaddr),
KEY first (first),
KEY last (last)
) TYPE=MyISAM PACK_KEYS=1 DELAY_KEY_WRITE=1;
Some statistics of a small table:
Indexes:
Keyname	Type	Cardinality	Action	Field
srcaddr 	INDEX	452858 	 	 	srcaddr
dstaddr 	INDEX	769858 	 	 	dstaddr
first 	INDEX	3849294 	 	 	first
last 		INDEX	3849294 	 	 	last
Space usage:
Type	Usage
Data	270,654	KiB
Index	117,642	KiB
Total	388,296	KiB
Row Statistics:
Statements	Value
Format	fixed
Rows		7,698,589
Row length	36
Row size	52
The current hardware is a dual AMD opteron 240, 15K RPM SCSI disk of 300GB, 2GB RAM. Without using any optimisation (insert 1 row at a time, table stored on disk) the system gets disk-IO bound after a few hours.
What is best way to handle this large amount of data while minimizing disk access?
My ideas:
- 
Get 16GB of RAM and create a table ‘netflowtmp’ on the ramdisk to do the inserts. This would not require any disk accesses.
 - 
What is the best/fastest way to insert into the database? (only 1 client does inserts)
 - 
Copy ‘netflowtmp’ to harddisk every hour for safety reasons (server crash…). What is the best way to do this? normal filecopy of the table files (what about table integrity?) or copy/move the table from the memory database to a disk database within mysql?
 - 
Use multiple tables (e.g. 1 table for each day of data)
In order to minimize the disk usage, I would compress these tables with myisampack (and myisamchk -rq to rebuild the indexes). I have noticed that the table itself is compressed, but the index files are not? Are there better ways to minimize disk usage? - 
For the select, I would merge the daily tables to have it look like 1 big table. Any drawbacks to this?
 
All ideas/suggestions are very welcome )