Query Optimization (Large Results)

Hey,

Basically what i have is several large tables (14mill records or more each), people send in smaller tables (usually around 20 - 50,000 records) and we flag their records that match ours.

Here is an example of the largest table (16mill records):

Table Layout:

DROP TABLE IF EXISTS dmd.gas;CREATE TABLE dmd.gas ( FORENAME varchar(20) NOT NULL, SURNAME varchar(20) NOT NULL, ADDR1 varchar(30) NOT NULL, ADDR2 varchar(30) NOT NULL, ADDR3 varchar(30) NOT NULL, TOWN varchar(30) NOT NULL, COUNTY varchar(30) NOT NULL, POSTCODE varchar(8) NOT NULL, URN varchar(10) NOT NULL, PRIMARY KEY POSTCODE (POSTCODE), KEY SURNAME (SURNAME), KEY ADDR1 (ADDR1), KEY FORENAME (FORENAME)) ENGINE=MyISAM DEFAULT CHARSET=latin1 PACK_KEYS=1;Keyname Type Cardinality FieldFORENAME INDEX 42992 FORENAMESURNAME INDEX 142554 SURNAMEADDR1 INDEX 2708531 ADDRESS1POSTCODE INDEX 902843 POSTCODE

A quick note on this, the URNS are used by the company that supply the data, they’re effectively useless as our client data doesn’t usually have a URN and were required to search on:

  • POSTCODE
  • First letter of FORENAME
  • Complete SURNAME
  • First 5 letters of address
    Due to the quality of the data were sent.

Anyway, right now the query im using is:

CREATE TEMPORARY TABLE duplicates SELECT companytable.POSTCODE FROM companytable INNER JOIN GAS ON GAS.POSTCODE=companytable.POSTCODE AND (LEFT(GAS.FORENAME, 1)=LEFT(companytable.FORENAME, 1)) AND (GAS.SURNAME=companytable.SURNAME) AND (LEFT(GAS.ADDR1, 5)=LEFT(companytable.ADDRESS1, 5)); UPDATE jewelry SET DUPE=1 WHERE POSTCODE IN (SELECT POSTCODE FROM duplicates);DROP TABLE duplicates;id select_type table type possible_keys key key_len ref rows Extra1 SIMPLE companytable ALL NULL NULL NULL NULL 80000 1 SIMPLE GAS ref POSTCODE,SURNAME POSTCODE 10 table.companytable.POSTCODE 11 Using where

Any help to optimize this query/table would be greatly appreciated, I’ve been going through the seminars and a million different sites for optimization.

The query does run fairly fast considering the records but it can hang “sending…” while selecting the data for the temp table and obviously i don’t feel its quite up to par with the standards that it could be.

On another note about the data, were sent an amended version of the “suppression” tables monthly, they’re the same layout but they’re usually slightly bigger (around 20,000 or so records bigger).

Anyway, recent ideas I’ve had are:

  • Replace LEFT() functions with INDEX(column())
  • Partitioning the table (moving fields not checked against to another table for use when absolutely required)
  • Possibly splitting the tables down into smaller tables
  • Obviously tweaking server variables to fit such large queries
  • Changing Hard-drive to RAID5 (for the reading speed, when updating the smaller table its usually no more than 1000 records changed)

Again, thanks for any suggestions/help in advance!

Although I’m not a pro, I think it would be useful to have the specs of your hardware, operating system and your current configuration so we can see how your buffers/variables are configured…

At the moment i’ve got it sat on a test server so im not expecting optimal performance from the hardware but eventualy when we upgrade to enterprise we’l be moving it into our main server room onto a IBM System x3850 and runing the sql with perl remotely.

Same applies to the variables, theyre default right now.

Im really looking more toward optimization of the tables and queries, im sure theres alot i can do, id just like advice from someone with perhaps more knowlage that me and a little direction - as i said it runs fairly efficiently as is but im always looking for ways to improve my work.

Well your company table seems to have a TYPE of ALL. So I don’t think you want that, you’d want to have that as specific as possible. Even with that, it’s only looking at 80,000 records so that’s not that bad, but from what I remember, you don’t want a Type=ALL since that will scan everything.

What does yout companytable look like?

Its just the table im checking again, the format can and will change per client but the table is currently:

CREATE TABLE company ( BigBook varchar(10) NOT NULL, BBOOK varchar(6) NOT NULL, TAB varchar(10) NOT NULL, URN int(30) NOT NULL auto_increment, Title varchar(5) NOT NULL, forename varchar(25) NOT NULL, surname varchar(25) NOT NULL, ADDRESS1 varchar(40) NOT NULL, ADDRESS2 varchar(40) NOT NULL, ADDRESS3 varchar(40) NOT NULL, ADDRESS4 varchar(40) NOT NULL, ADDRESS5 varchar(40) NOT NULL, ADDRESS6 varchar(40) NOT NULL, POSTCODE varchar(8) NOT NULL, DOB varchar(20) NOT NULL, PRIMARY KEY (URN)) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=80002 ;

bad practice to put the primary on a URN i know, i just quickly wrote the statement before i stumbled on this forum.

One obvious problem i can see is using LEFT() is stopping mysql from using the indexes on ADDR1 and FORENAME, i’ve thought about restricting the size of the index on the column to the equivilent values that im using in LEFT() at the moment but i wonder if it will just search the whole word rather than the index.