Hello,
I’m developing a database backend for a network traffic analysis tool. It has to store information about every external network connection made on a company network (20000 pc’s, 300Mbits/sec of traffic). The data itself must be stored up to 1 month.
Inserts to the database are done in bursts of approx 50000 inserts every 15 seconds. There is only 1 client that performs inserts. There are only a few selects per second using only indexed colomns. There are never any modifications to the data… just many inserts and a few selects
In the end, the entire database is several 100’s of gigabytes and contains billions of rows.
Currently I have the following table:
CREATE TABLE netflow
(
srcaddr
int(10) unsigned NOT NULL default ‘0’,
dstaddr
int(10) unsigned NOT NULL default ‘0’,
input
smallint(5) unsigned NOT NULL default ‘0’,
output
smallint(5) unsigned NOT NULL default ‘0’,
packets
int(10) unsigned NOT NULL default ‘0’,
octets
int(10) unsigned NOT NULL default ‘0’,
first
int(10) unsigned NOT NULL default ‘0’,
last
int(10) unsigned NOT NULL default ‘0’,
srcport
smallint(5) unsigned NOT NULL default ‘0’,
dstport
smallint(5) unsigned NOT NULL default ‘0’,
flags
tinyint(3) unsigned NOT NULL default ‘0’,
proto
tinyint(3) unsigned NOT NULL default ‘0’,
tos
tinyint(3) unsigned NOT NULL default ‘0’,
KEY srcaddr
(srcaddr
),
KEY dstaddr
(dstaddr
),
KEY first
(first
),
KEY last
(last
)
) TYPE=MyISAM PACK_KEYS=1 DELAY_KEY_WRITE=1;
Some statistics of a small table:
Indexes:
Keyname Type Cardinality Action Field
srcaddr INDEX 452858 srcaddr
dstaddr INDEX 769858 dstaddr
first INDEX 3849294 first
last INDEX 3849294 last
Space usage:
Type Usage
Data 270,654 KiB
Index 117,642 KiB
Total 388,296 KiB
Row Statistics:
Statements Value
Format fixed
Rows 7,698,589
Row length 36
Row size 52
The current hardware is a dual AMD opteron 240, 15K RPM SCSI disk of 300GB, 2GB RAM. Without using any optimisation (insert 1 row at a time, table stored on disk) the system gets disk-IO bound after a few hours.
What is best way to handle this large amount of data while minimizing disk access?
My ideas:
-
Get 16GB of RAM and create a table ‘netflowtmp’ on the ramdisk to do the inserts. This would not require any disk accesses.
-
What is the best/fastest way to insert into the database? (only 1 client does inserts)
-
Copy ‘netflowtmp’ to harddisk every hour for safety reasons (server crash…). What is the best way to do this? normal filecopy of the table files (what about table integrity?) or copy/move the table from the memory database to a disk database within mysql?
-
Use multiple tables (e.g. 1 table for each day of data)
In order to minimize the disk usage, I would compress these tables with myisampack (and myisamchk -rq to rebuild the indexes). I have noticed that the table itself is compressed, but the index files are not? Are there better ways to minimize disk usage? -
For the select, I would merge the daily tables to have it look like 1 big table. Any drawbacks to this?
All ideas/suggestions are very welcome )