3 way LEFT JOIN with GROUP CONCAT, eliminate using filesort

Hi everyone

I’m relatively new to MySQL and gradually learning new stuff, but I’m having a problem with a particular query. I’ve asked around in the PHP DevNet forums and it seems I should be getting much better performance on this query, as my tables are pretty small and the query’s not particularly advanced.
The structure is nicely normalised, something I’ve wanted to have for years. I’m gradually switching our company intranet subscribers database from filemaker to MySQL.
My server is Ubuntu Server 8.04 LTS, MySQL5, PHP5 and Apache2. On a pretty recent Intel Core 2 server with 4GB memory.

I’d really appreciate it if someone could give me some advice on this, as it’s been driving me crazy!

contacts has about 36,000 records
companies: 38,000
tags_data: 42,000
tags: 980
My total DB size is about 26MiB, so it’s tiny in comparison to many MySQL databases. I think this query should not be taking 32 seconds to complete!

Basically the tags table has various tags/classifications which can be applied to companies. So you can tag a company as MANUFACTURER, DISTRIBUTOR, SHOW EVENT ORGANISER… etc and there’s 980 different tags/classifications to choose from. Each company can have multiple tags applied, and the tags_data table stores which tags are applied to which company.

Here’s a quick diagram of part of the database which should help to visualise

green bold are the primary key columns. Red are the columns which link the data together.

Here’s the query I’m trying to run:

SELECT contacts.contact_name, companies.company_name, GROUP_CONCAT(tags.name ORDER BY tags.name ASC SEPARATOR ‘,’) AS classificationsFROM contactsLEFT JOIN companies USING(company_id)LEFT JOIN tags_data ON companies.company_id=tags_data.company_idLEFT JOIN tags ON tags_data.tag_id=tags.tag_idWHERE contacts.subscribed='VIP’GROUP BY contacts.contact_id

The syntax on that might be improved I guess, but I don’t know how to do that. That’s the limit of my MySQL knowledge really.

I’m trying to get contact_name, company_name and a column called classifications which is a comma separated list of the tags/classifications names that are selected for that particular company.
contacts.subscribed can have 3 different values: NULL, FREE or VIP

This query only returns 777 records, but I’ll be looking to return up to 20,000 with a variation of this query if I can get it running faster!
Initially this query took about 80 seconds, but I added a composite index (only just learnt about them) on the tags_data table (index called tags_companies) which reduced it to about 30 seconds.
I removed the ORDER BY or LIMIT from the query earlier as well, but it made very little difference.

Here’s the output of EXPLAIN on this query:

id select_type table type possible_keys key key_len ref rows Extra1 SIMPLE contacts ref IX_subscribed IX_subscribed 15 const 1103 Using where; Using temporary; Using filesort1 SIMPLE companies eq_ref PRIMARY PRIMARY 4 dwcintranet.contacts.company_id 1 1 SIMPLE tags_data index company_id tags_companies 26 NULL 41930 Using index1 SIMPLE tags eq_ref PRIMARY PRIMARY 4 dwcintranet.tags_data.tag_id 1

So I think the problem is the Using filesort. I’ve tried taking the query apart to eliminate the filesort, and it seems it’s the 2nd join that’s causing the filesort: tags_data ON companies
But I don’t know what else I could index to sort this out

Any ideas/suggestions?
I would really appreciate some help on this, I’m convinced that this query shouldn’t be taking 30 seconds!

Thanks, Ben

Here’s the SHOW CREATE TABLE of the 4 tables

contacts

CREATE TABLE contacts ( contact_id int(11) NOT NULL auto_increment, company_id int(11) default NULL, address1_contact varchar(255) default NULL, address2_contact varchar(255) default NULL, address3_contact varchar(255) default NULL, business_card varchar(2) default NULL, contact_id_old int(20) default NULL, contact_name varchar(255) default NULL, county_state_contact varchar(100) default NULL, create_name varchar(60) default NULL, create_stamp int(10) default NULL, email_contact varchar(255) default NULL, email_contact_optout tinyint(1) default NULL, fax_contact varchar(255) default NULL, geo_contact varchar(255) default NULL, job_title varchar(255) default NULL, subscribed varchar(4) default NULL, mail_contact tinyint(1) default NULL, modify_count int(7) default ‘0’, modify_name varchar(60) default NULL, modify_stamp int(10) default NULL, note text, postcode_zip_contact varchar(50) default NULL, reg_stamp int(10) default NULL, tel_cell varchar(255) default NULL, tel_contact varchar(255) default NULL, tel_ext varchar(255) default NULL, tel_home varchar(255) default NULL, temp varchar(50) default NULL, town_city_contact varchar(100) default NULL, view_count int(7) default ‘0’, view_name varchar(60) default NULL, view_stamp int(10) default NULL, PRIMARY KEY (contact_id), KEY company_id (company_id), KEY contact_name (contact_name), KEY IX_subscribed (subscribed), KEY companies_contacts (company_id,contact_id), KEY contacts_companies (contact_id,company_id)) ENGINE=MyISAM AUTO_INCREMENT=37645 DEFAULT CHARSET=utf8

companies

CREATE TABLE companies ( company_id int(11) NOT NULL auto_increment, ad_lead tinyint(1) default NULL, address_priv tinyint(1) default NULL, address1 varchar(255) default NULL, address2 varchar(255) default NULL, address3 varchar(255) default NULL, advisory varchar(255) default NULL, advisory_name varchar(60) default NULL, advisory_stamp int(10) default NULL, agency_address varchar(255) default NULL, agency_company varchar(255) default NULL, agency_contact varchar(255) default NULL, agency_email varchar(255) default NULL, agency_fax varchar(255) default NULL, agency_id int(11) default NULL, agency_tel varchar(255) default NULL, budget_year1 varchar(3) default NULL, budget_year2 varchar(3) default NULL, company_name varchar(255) default NULL, county_state varchar(100) default NULL, create_name varchar(60) default NULL, create_stamp int(10) default NULL, currency varchar(3) default NULL, dir_contact varchar(255) default NULL, email text, email_optout tinyint(1) default NULL, email_priv tinyint(1) default NULL, fax varchar(255) default NULL, fax_priv tinyint(1) default NULL, geo varchar(2) default NULL, hanging_file tinyint(1) default NULL, iso_code varchar(20) default NULL, locked_name varchar(60) default NULL, locked_stamp int(10) default NULL, master_id int(11) default NULL, modify_count int(7) default ‘0’, modify_name varchar(60) default NULL, modify_stamp int(10) default NULL, postcode_zip varchar(50) default NULL, rep varchar(10) default NULL, tel varchar(255) default NULL, tel_priv tinyint(1) default NULL, town_city varchar(255) default NULL, vat_number varchar(20) default NULL, view_count int(7) default ‘0’, view_name varchar(60) default NULL, view_stamp int(10) default NULL, website text, PRIMARY KEY (company_id), KEY geo (geo), KEY company_name (company_name)) ENGINE=MyISAM AUTO_INCREMENT=39200 DEFAULT CHARSET=utf8

tags_data

CREATE TABLE tags_data ( tagdata_id int(11) NOT NULL auto_increment, tag_id int(11) default NULL, company_id varchar(6) default NULL, create_name varchar(60) default NULL, create_stamp int(10) default NULL, PRIMARY KEY (tagdata_id), KEY tag_id (tag_id), KEY company_id (company_id), KEY tags_companies (tag_id,company_id)) ENGINE=MyISAM AUTO_INCREMENT=42205 DEFAULT CHARSET=utf8

tags

CREATE TABLE tags ( tag_id int(7) NOT NULL default ‘0’, category varchar(255) default NULL, description varchar(255) default NULL, name varchar(255) default NULL, PRIMARY KEY (tag_id), KEY category (category), KEY name (name)) ENGINE=MyISAM DEFAULT CHARSET=utf8

I’d really appreciate it if someone could take a look through to see how I can speed this up!

Thanks so much, Ben

I think your problem is here:

1 SIMPLE tags_data index company_id tags_companies 26 NULL 41930 Using index

The previous line should produce a set of company ids used to join into tags_data. Unfortunately, MySQL is doing an index scan for every row produced from the previous step. Basically, MySQL is evaluating 41930 * 1103 rows to produce your result set. That’s over 40M, which explains why things are taking so long.

I think you can clear this up by eliminating the left join from tags_data into tags. Since this is a join table, you really shouldn’t have orphaned rows. If you do, they should be easy to clean up.

You may also want another covering index on tags_data, on (company_id, tag_id). This will be used in joins from companies to tags. Your existing index only helps when joining from tags to companies. Since you’re using MyISAM, and only indexes are cached, having (company_id, tag_id) instead of just company_id should help things move faster as well.

Thanks for the reply.
I figured that’s what was happening behind the scenes.

I think I need the LEFT JOIN from tags_data into tags, because I want to access the tags.name field in my GROUP CONCAT()

Do I still need that particular LEFT JOIN to accomplish that?

Cheers, B

Thanks for the reply.

I just changed that tags on tags_data.tag_id=tags.tag_id into a regular JOIN, rather than LEFT JOIN
And it ran much quicker, but it returned fewer results.

I do want to return companies even if they have no tags selected in tags_data. So I guess that has to stay a LEFT JOIN.

I added that extra covering index between tags_data and companies.
So now I have 2 on there: company_id, tag_id and tag_id, company_id

But the query takes about the same time.

Here’s an updated EXPLAIN output after adding that 2nd index, but all still as LEFT JOINS as before:

id select_type table type possible_keys key key_len ref rows Extra1 SIMPLE contacts ref IX_subscribed IX_subscribed 15 const 1103 Using where; Using temporary; Using filesort1 SIMPLE companies ref PRIMARY,company_id_geo company_id_geo 4 dwcintranet.contacts.company_id 1 1 SIMPLE tags_data index company_id,companies_tags tags_companies 26 NULL 41930 Using index1 SIMPLE tags eq_ref PRIMARY,tag_id_name PRIMARY 4 dwcintranet.tags_data.tag_id 1

Any idea on what else I can try indexing or changing?
Would it be quicker to use subquery/subselect instead of 3 joins? I’ve never tried using those before so not sure if that’s possible.

Thanks for all your help so far!
Cheers, B

I’d be curious to see the results of this:

EXPLAIN SELECT companies.company_name, tags_data.*FROM companies LEFT JOIN tags_data USING(company_id)

If the plan for that query chooses the (tag_id, company_id) index, then I’d try running ANALYZE TABLE to make sure the index statistics are up to date.

I ran that EXPLAIN and here’s the output:

id select_type table type possible_keys key key_len ref rows Extra1 SIMPLE companies ALL NULL NULL NULL NULL 36557 1 SIMPLE tags_data ALL company_id,companies_tags NULL NULL NULL 41930

So it looks like it’s using the company_id,tag_id index (companies_tags).

No index is being used according to that plan, which means that given how much data you have, MySQL thinks a table scan is faster than using the index. It could have used one of those indexes, but it decided not to.

See what the explain looks like for this:

SELECT tags_data.company_id, GROUP_CONCAT(tags.name ORDER BY tags.name ASC SEPARATOR ‘,’) AS classifications FROM tags_data INNER JOIN tags USING(tag_id) GROUP BY tags_data.company_id

If that’s not terrible (the product of the rows columns is not huge), you can try using it to eliminate the last left join.

SELECT contacts.contact_name, companies.company_name, company_tags.classificationsFROM contactsLEFT JOIN companies USING(company_id)LEFT JOIN ( SELECT tags_data.company_id, GROUP_CONCAT(tags.name ORDER BY tags.name ASC SEPARATOR ‘,’) AS classifications FROM tags_data INNER JOIN tags USING(tag_id) GROUP BY tags_data.company_id ) AS company_tagsON companies.company_id = company_tags.company_idWHERE contacts.subscribed=‘VIP’

However, I’m not sure this is a good idea, since that inner select doesn’t get any benefit from the restriction applied to contacts and therefor companies. It will always load all your tags, which might be fine if that’s the common case.

I also didn’t try either of these myself, so you might have to clean up any syntactic problems I have.

The other thing you should consider since you’re implementing tags and you’re already using MyISAM, is using fulltext search. You’d end up just adding a text column to companies that contains a comma separated list of tags and creating a fulltext index on that column.

The query you’re trying to write will then stop after the left join into companies, and your classifications will already be available to you without any grouping. As an added bonus, searching by tags, especially a combination of tags, can be done through the fulltext index instead of some awful looking query which joins the tags table over and over.

You’ll need to change your code to maintain this column, which is probably best done outside of SQL in whatever language you’re already using for your application.