Hey,
Basically what i have is several large tables (14mill records or more each), people send in smaller tables (usually around 20 - 50,000 records) and we flag their records that match ours.
Here is an example of the largest table (16mill records):
Table Layout:
DROP TABLE IF EXISTS dmd
.gas
;CREATE TABLE dmd
.gas
( FORENAME
varchar(20) NOT NULL, SURNAME
varchar(20) NOT NULL, ADDR1
varchar(30) NOT NULL, ADDR2
varchar(30) NOT NULL, ADDR3
varchar(30) NOT NULL, TOWN
varchar(30) NOT NULL, COUNTY
varchar(30) NOT NULL, POSTCODE
varchar(8) NOT NULL, URN
varchar(10) NOT NULL, PRIMARY KEY POSTCODE
(POSTCODE
), KEY SURNAME
(SURNAME
), KEY ADDR1
(ADDR1
), KEY FORENAME
(FORENAME
)) ENGINE=MyISAM DEFAULT CHARSET=latin1 PACK_KEYS=1;Keyname Type Cardinality FieldFORENAME INDEX 42992 FORENAMESURNAME INDEX 142554 SURNAMEADDR1 INDEX 2708531 ADDRESS1POSTCODE INDEX 902843 POSTCODE
A quick note on this, the URNS are used by the company that supply the data, they’re effectively useless as our client data doesn’t usually have a URN and were required to search on:
- POSTCODE
- First letter of FORENAME
- Complete SURNAME
- First 5 letters of address
Due to the quality of the data were sent.
Anyway, right now the query im using is:
CREATE TEMPORARY TABLE duplicates SELECT companytable.POSTCODE FROM companytable INNER JOIN GAS ON GAS.POSTCODE=companytable.POSTCODE AND (LEFT(GAS.FORENAME, 1)=LEFT(companytable.FORENAME, 1)) AND (GAS.SURNAME=companytable.SURNAME) AND (LEFT(GAS.ADDR1, 5)=LEFT(companytable.ADDRESS1, 5)); UPDATE jewelry SET DUPE=1 WHERE POSTCODE IN (SELECT POSTCODE FROM duplicates);DROP TABLE duplicates;id select_type table type possible_keys key key_len ref rows Extra1 SIMPLE companytable ALL NULL NULL NULL NULL 80000 1 SIMPLE GAS ref POSTCODE,SURNAME POSTCODE 10 table.companytable.POSTCODE 11 Using where
Any help to optimize this query/table would be greatly appreciated, I’ve been going through the seminars and a million different sites for optimization.
The query does run fairly fast considering the records but it can hang “sending…” while selecting the data for the temp table and obviously i don’t feel its quite up to par with the standards that it could be.
On another note about the data, were sent an amended version of the “suppression” tables monthly, they’re the same layout but they’re usually slightly bigger (around 20,000 or so records bigger).
Anyway, recent ideas I’ve had are:
- Replace LEFT() functions with INDEX(column())
- Partitioning the table (moving fields not checked against to another table for use when absolutely required)
- Possibly splitting the tables down into smaller tables
- Obviously tweaking server variables to fit such large queries
- Changing Hard-drive to RAID5 (for the reading speed, when updating the smaller table its usually no more than 1000 records changed)
Again, thanks for any suggestions/help in advance!