I have a IP location lookup table with 20 million rows. I can perform about 1 sequential lookup per second and I need to get it up to about 50 per second.
CREATE TABLE `ip_lookup` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`start_ip_int` bigint(20) NOT NULL,
`end_ip_int` bigint(20) NOT NULL,
`country_name` varchar(50) CHARACTER SET latin1 DEFAULT NULL,
`city_name` varchar(50) CHARACTER SET latin1 DEFAULT NULL,
`custom_field1` varchar(30) CHARACTER SET latin1 DEFAULT NULL,
`custom_field2` varchar(20) CHARACTER SET latin1 DEFAULT NULL,
`custom_field3` varchar(30) CHARACTER SET latin1 DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `start_ip_int` (`start_ip_int`),
KEY `end_ip_int` (`end_ip_int`),
) ENGINE=InnoDB AUTO_INCREMENT=18245826 DEFAULT CHARSET=utf8;
I am processing a large number of historical web server logs. I can do queries in parallel (e.g. 10 processes all getting 1 query per second) but parallelising the log processing code is a much bigger task than I can handle right now.
Maxmind supply a compiled lookup tool and their database in binary format. Lookups using that are lightning fast.
How can I create something similar for an arbitrary dataset? I'd like to load the 20M rows into some magic binary indexer and get the code to make API calls to some daemon instead of MySQL lookups. I can't use Maxmind - the data I have is customised heavily.
A lookup is simply:
select country_name, custom_field1, custom_field2 from ip_lookup where start_ip_int >= inet_aton('74.125.28.99') and inet_aton('74.125.28.99') <= end_ip_int limit 1
Theoretically, I could split up the 20M rows by netblock and store them as text files on disk and have a lookup table in the code to tell it which folder to look in. But there must be a better way! I have Percona Server 5.5.28-29.2
I'd be very grateful for any advice.
Thanks!
Due to the nature of B-tree indices, all rows satisfying start_ip_int >= inet_aton(‘74.125.28.99’) are retrieved, and checked for inet_aton(‘74.125.28.99’) <= end_ip_int (or vice versa).
Is the inequality for start_ip_int correct? (>= instead of <=)
If ranges do not overlap, use this:
select country_name, custom_field1, custom_field2 from ip_lookup where start_ip_int <= inet_aton(‘74.125.28.99’) order by start_ip_int desc limit 1
Thanks for your response. You’re correct, it was supposed to be start_ip_int >= inet_aton(). Good suggestion to leave the end_ip_int out of the query but I just tried it and there was no performance improvement, and it introduced some incorrect results (presumably I need to sort the results which slow things down).
If I replace the MySQL lookup with an API call to the Maxmind binary database I get a massive performance increase (like 300/second). I didn’t think MySQL could compete with that for sequential lookups? As i’m trying to process a huge amount of web logs (probably over a million unique IPs), I was hoping for some way to create my own version of whatever Maxmind do in their binary… I can’t use maxmind because I have my own custom fields but i’d really like something that I can export the MySQL table into that magically speeds it up.
I’ve been messing around with Sphinx which is good for getting the hits off my busy MySQL server, but it didn’t give me the performance boost I need either.
The query should be as fast as maxmind and you can even include the restriction on end_ip if you get wrong results. The extra ORDER BY is the most important part (use ASC for >= instead of <=).
Check EXPLAIN whether the index on start_ip_int is used, and try FORCE INDEX if necessary.
You’re correct it needed the indexed forced… It’s given me about an 8x increase so thanks very much!
SELECT country_name, custom_field1, custom_field2 FROM ip_lookup FORCE INDEX (ix_natural) WHERE inet_aton("$ip") >= startip AND inet_aton("$ip") <= endip AND " LIMIT 1
Unfortunately it’s still way slower than hitting up something like the API to maxmind binary DB.