I’m working on a project which is going to be an e-mail marketing application and I’m running into a problem so I’m looking for some advice here.
The application has a CRM tabel that contains all the information of subscribers:
id int(10) primary key
crmid char(20) index (A unique hash)
firstname char(80)
lastname char(80)
emailadres char(80)
optin tinyint(1)
And more fields depending on what is added for the specific implementation of course.
Now whenever there is an email send to a CRM object I store this in a second table ‘Send Emails’ which is as follows:
id int(10), primary key
emails_id int(10) index
crmid char(20) index
opened tinyint(1) index
clicked tinyint(1) index
So I have 100.000 records in my CRM table and I have send 10 emails to all these CRM objects so I have 1.000.000 records in my ‘Send Emails’ table.
Besides that we also store page statistics and also keep track of which user click what link in what email on which data. And we need to be able to filter on this too.
The problem is that I need to be able to make selections in the application where for example I need the following:
1: I want all crm-records that have received email 1
2: I want all crm-records that have received a email (Regardless which one)
3: I want all crm-records that have opened email 1 but not have received email 2
4: I want all crm-records that have opened email 1 and not have clicked in email 2
These are small examples but there is no limit on the size of the selections. The results of these selections must be available instantly because we want to:
1: Be able to send e-mail to a specific selection
2: Generate user specific content on a webpage based on these selections.
For now I tried several options:
1: Using sub queries to filter the results
2: Using endless left joins
3: Chunking the filters and compare the data in PHP
When I create simple selections all these options either bring the CPU to a load of 100% for minutes without result or take a very long time to complete.
Anyone advice on how to tackle this, or is this basically insane? Or is the solution to just show up with massive CPU power on the database server (since now we are testing this on a development machine which can’t handle the simple queries even)?
In the “Send Emails” table, does the column “emails_id” reference an “email” table or is this just a number for the email?
Also, is there a reason you are not referencing the “id” column from the “CRM” table instead of the “crmid”? JOINing on a number (especially the Primary Key) will be much more efficient than a text based column. Try changing this first!
What is the structure of your statistics table?
As for your queries…
What does “received” mean? You sent it or the recipient opened it?
SELECT crm.*
FROM crm
INNER JOIN send_emails
ON crm.crmid = send_emails.crmid
AND send_emails.emails_id = 1;
(Change this to use crm.id = a matching column in send_emails.)
If you want to know if the email was opened add “AND send_emails.opened = 1”.
Same deal - what does “received” mean?
SELECT crm.*
FROM crm
INNER JOIN send_emails
ON crm.crmid = send_emails.crmid
GROUP BY crm.id
SELECT crm.*
FROM crm
INNER JOIN send_emails AS SE1
ON crm.crmid = SE1.crmid
AND SE1.emails_id = 1
LEFT JOIN send_emails AS SE2
ON crm.crmid = SE2.crmid
AND SE2.emails_id = 2
WHERE SE2.id IS NULL
GROUP BY crm.id
Same as above but change the second JOIN to an INNER and change the WHERE to “SE2.clicked = 0”
If your test box isn’t very powerful that’s another issue. It sounds like you’re going to be storing lots of records - although your data is relatively compact. I think your biggest bottleneck right now is comparing string columns so fix that table to use the CRM ID. You could also add a unique index to crmid in the CRM table but the join should be on the Primary Key.
Lastly, do you know anyone with an 80 character first name or last name? Chopping those down to 20 characters would use 4 times less space and make your temp tables (created behind the scenes) smaller as well. Smaller = better when it comes to performance as more records can fit on each data page.
I don’t usually plug our consulting services, but this is a hard problem. I have worked on several variations of this kind of thing before. You will benefit a lot more from paid consulting… it is way too hard to address on a forum post. → www.percona.com/ and submit the contact-us form.
@xaprb: Totally understand that, thank you! I just wanted to see if the community could give me some directions and different views to see if I can figure it out first.
@Troy:
emails_id is indeed a reference to an email table.
All tables are in MyISAM.
My co-worker did the crmid reference before I got involved in the project but it has to do with the crmid being send in the emails since a regular id would be to easy to manipulate. Thank you for confirming that it’s more efficient to reference by numeric id.
The structure of the tracking-link statistics is as follows:
tracking_statistics:
id int(10) primary key
send_emails_id int(10) index (reference to send_Emails table)
tracking_link_hash char(20) (index) (again a hash reference to link hash in de tracking_links table)
date datetime
Received means basically that the e-mail has been send successfully. So as a result of that there is a record inserted in the ‘Send Emails’ table.
The reason for the char(80) field for a first/lastname:
Clients can build their own crm table. So besides some system specific fields the client is able to add fields through the interface and to keep it simple it’s restricted to:
normal text - results in a char(80) column
large text - results in a char(255) column
huge text - results in a text column
number - results in a float column
I see in your queries some things I haven’t tried so I’m definitely going to try that.
(I only did the crm.crmid = table.crmid as a join condition).
Also, the selections that I make are defined by the users that use a simple form to click the selection together so when executed I have to create the query dynamically with PHP based on the users input.
You can still use the crmid in the email without joining on it in the query. Think of it like a person’s social security number - it’s part of the record and a useful way to find someone but that’s not a good “key”.
If you only need to test for the existence of a record in the send_emails table then an inner join between it and the crm table will filter out anyone who hasn’t received an email. Adding the email id in the where clause will filter out anyone who hasn’t received that specific email. So, there’s your list. If you want people who received more than one email you’ll need to GROUP or use DISTINCT so the recipients aren’t duplicated (I like GROUP because I can use an aggregate like COUNT to see how many emails they received).
I don’t think “clicked” is totally necessary in the send_emails table if you have another table that tracks clicks because you’re going to have to join that table to figure out what was clicked. However, If you typically just need to see if anything was clicked it’s ok to “roll up” the statistic to a true/false value (what you have).
Once again, your “tracking_link_hash” column should be using an id (preferably a number and not a string).
On your 80 character string columns, you can add an index on only a portion of the data (say, the first 20 characters) which can speed things up. Your data will still be too big for any real gains - unless (possibly) the index can supply the data without going back to the table.
Dynamically building the SQL statement is OK because there aren’t many tables so the statements should look a lot alike. However, it might be just as easy to build the statements and then choose the correct one based on the user’s input. Debugging will be simpler and you can “fine tune” the queries.
I test all my complex queries on a relatively slow desktop machine first. Once the speed is acceptable I know it will be blazing fast once it’s on the server. If your test box crashes or is so slow that you can’t get a decent benchmark then I’d be looking at problems with the schema first, then the queries. Use a smaller data set first but, eventually, you need to make sure the thing holds up under realistic conditions.
Thank you so much for all this valuable information!
Even though I have not implemented all your suggestions yet I already see I high increase in performance and no more hangups from the database server.
Glad I could help. This is certainly one of those situations where a few changes can make a large impact but - to do it right - might take a lot of changes for what appears to be minor improvements. A lot of times it will take a while to see the real impact (of doing it the right way).