I have a table of 30 million+ records, that I need to search using wildcards. Each record is a 7-character reference, and I need to be able to, for example, return all records that start with A and the last three characters are BCD.
I have tried using like:
SELECT ref WHERE ref LIKE ‘A___BCD’
and also splitting the reference into seven separate fields and using equality:
SELECT ref WHERE first = ‘A’ AND fifth = ‘B’ AND sixth = ‘C’ AND seventh = ‘D’
I’ve tried both these, with and without indexes on the relevant fields, but the queries are taking upwards of five minutes. (
What is the answer? Would FULLTEXT work on single word fields? Or should I be using flat text files and grep? Or something else?
If its taking you 5 minutes to process 30M rows, its probably because you are table scanning and not actually using an index. If you could provide the query and the results of EXPLAIN, I could try to make some suggestions.
Purely speculating, it could be that your restriction is not selective enough. Using your example, if 50% of the rows in the table have a ref than begins with ‘A’, then you shouldn’t expect an index to help you, because your query isn’t selective enough and a table scan will perform better than using the index. Basically, the index stops being helpful when it encounters the first wildcard in a left-to-right order, so anything you can do to increase the number of characters before the wildcard will help narrow down the result set. For example, if you know you are always querying for something in the form of X___XXX, that is to say you will always know the last three characters, then you could store the ref backwards and index it to have a more selective index than the original case.
Having separate columns shouldn’t be better than a single column in this case, and full text search won’t help you at all here.
id 1select_type SIMPLEtable exp_platestype rangepossible_keys registrationkey registrationkey_len 9ref NULLrows 1342824Extra Using where; Using index
and here’s one for the second version:
id 1select_type SIMPLEtable exp_platestype ALLpossible_keys NULLkey NULLkey_len NULLref NULLrows 1342824Extra Using where
I found that removing the INDEXes from the single-character fields speeded up the query significantly, to just over a minute, but that’s still too slow.
One idea I’ve come up with is to split the table into 36 different tables and store the references depending on the character (A-Z and 0-9) that they start with; and then also another 36 tables containing the references backwards, and use those if the first character is unknown.
I don’t think introducing more tables will help very much. Your query, which is essentially a LIKE ‘A%’, matches 1.3 million rows. Putting those 1.3 million rows into another table and changing that query to LIKE ‘%’ (because the ‘A’ is implied) will still end up matching 1.3 million rows. You need a way to reduce the amount of matches. Having two indexes, one on the original column and one on its reverse will probably help, as long as you either start or end the pattern with a few characters.
If you can have any possible combination of wildcards and know values, you’d need to create quite a few indexes to be able to cover all access paths. That will require quite a bit of disk space and probably effect key cache/buffer pool performance.
Thanks. I’ve used part of your suggestion - I now have 26 tables (one for each letter), containing the original string and its reverse, and then I match against whichever is the most appropriate depending on the first wildcard character in the search string.
Each table has between 800,000 and 2 million rows, and the queries are down to less than 2 seconds most of the time. )