how fulltext support chinese search

as title,thanks much.

See this:
http://mysql.com/doc/refman/5.0/en/cjk-faq-fulltext-searches .html

[B]Quote:[/B]
There is a simple problem with FULLTEXT: we need to know the end of a word. With Western writing this is rarely a problem because there are spaces between words. With Asian writing this is not the case. We could use half-good solutions, like saying that all Han characters represent words, or depending on (Japanese) changes from Katakana to Hiragana which are due to grammatical endings. But the only good solution requires a dictionary, and we haven't found a good open-source dictionary.

IMO, they are being lazy.

Speeple,

Thanks for reply. If you know some good parser for Chinese/Japanise language let me know I would pass it to MySQL development team. Note MySQL Can’t include GPL code as it is incompatible with commercial license

You also can try Sienna patches for MySQL full text search:

[URL]http://qwik.jp/senna/install_en.html[/URL]

It supports full text search for Japanese/Chinese language by using
[URL]http://mecab.sourceforge.jp/[/URL]

Sorry I can’t tell too much about them as most of documentation is in Japanese.

What I don’t understand is why MySQL isn’t shipped with PCRE support:
[URL]http://www.pcre.org/[/URL]

This would allow more powerful REGEX comparisons in MySQL (including Unicode support, which it currently lacks in “REGEX” functions).

PCRE can also differenciate between Unicode letters, numbers and punctuation. This means rows of data could be pushed through the fulltext indexer and only record valid Unicode letters and numbers (this includes support for far east symbols, arabic etc) e.g.:

function clean(string input) {
// Strips punctuation, control characters THE LOT, except useable text

return trim(replace(‘[^\p{L}\p{N}]+’, ’ ', input));
}

However, the Unicode spec is free to use so MySQL could implement their own library with less overhead, developed specifically for the database enviroment. They already have multibyte safe functions meaning they must be able to recognise individual unicode characters – all you need is to check back to the unicode spec and see if it’s a valid letter/number, if it’s not, discard it when building the index.

In the current full-text indexer you can index unicode characters such as: ★

You can then search for “★” and it will match the row!

It shouldn’t be like this because in the unicode spec ★ is not a valid letter/number. Do a search for “★” on Google, Yahoo! or MSN, notice how they don’t recognise it?

You can also search for spacing symbols from Japanese and Hindu scripts… it shouldn’t be like this.

Supporting this would allow much better international application development… and even REDUCE the full-text index size (performance gain).

I agree perl expression support would be handy. Might be licensing in the issue in this case.

But I guess you did not get the point about Chinese language - in this language no word separaters are used in many cases. Ie you will get text which will be “thisisthetextwhichneedstobeindexed”

How does perl expressions help you to split it to words ?

PCRE is BSD licensed, not sure if that fits into the MySQL ecosystem tho.

Reference the Unicode spec freely available here:
[URL]http://www.unicode.org/Public/UNIDATA/[/URL]

You can use this to determine the properties of a character.

Here’s what I’d do:

Firstly, strip all non-letter, non-number unicode characters. These characters have no place in a full-text index.

I would then build the full-text index analysing the character properties.

The unicode spec provides RANGES for different scripts:
Code point range Block name Release
U+3400…U+4DB5 CJK Unified Ideographs Extension A 3.0

So, we’ve now determined that any character with a range from U+3400 to U+4DB5 is a CJK character, and each character must be treated in a similar fashion as a western script “word”.

Here’s a FT engine for MSSQL which supports CJK via the methods I have described:
[URL]http://www.codeproject.com/useritems/dexleapftc.asp[/URL]

I quote:

[B]Quote:[/B]
Splitting words from a phrase is called "stemming". Most full-text engines use "stemming symbols" to stem Latin based languages such as English, French, Spanish, Italian and German. The stemming symbols include space,comma,period and others. For far east languages such as Chinese, Japanese and Korean (CJK), one character is a stemming symbol for itself.

The FTC uses the UNICODE range to do stemming. For any given string, the FTC handles it char by char. If the char is between 0x0041 and 0x005a, the FTC knows it is a meaningful Latin char. The FTC doesn’t check the next char until it hits any non-Latin UNICODE. As long as a non-Latin code gets hit, the FTC will build a word from the previous chars. Note: The range 0x0041-0x005a is just an example. Please check www.unicode.org to find out more details about UNICODE.

Hi,

I’ve done a test to try and show you want I mean, and how easy it is to achieve.

You’ll need a fairly recent version of PHP.

Works on my pretty standard installation of PHP at home (PHP 5.1.4).

Speeple,

So you can use MySQL 5.1 and implement same parsing in your UDF (MySQL 5.1 allows you to use plugins for fulltext search to do data processing)

Now I still think you oversimplyfy things. I’m not expert in Chinese or Japanese but as far as I remember their characters would match to character sequences rather than full words. Some simple words may be single character but other are not.

In this case treating characters as keywords you can find what you’re looking for but results may contain a lot of garbage especially for frequency based relevance as MySQL Full Text search uses.

Thanks for the code. I hope it will be useful to the users.

thanks Speeple and Peter ) )

I may be oversimplyfying things, but it IS an improvement over the current status of full-text search.

Most people are not going to implement a UDF function - although personally I think the introduction of UDF very attractive.

I think people would be VERY pleased if a system as the above described was implemented to support CJK glyphs in MySQL full-text. Sure it wouldn’t be 100% perfect, but even western script support in MySQL ft isn’t perfect.

I do not know enough about japanese to be able to tell how functional it would be.

I know if you do same for western languages indexing letters as words it would be horrible both from quality and search performance.

There is however no harm if such UDF would be provided. Few people may be able to implement udf but someone who can - can get it uploaded at forge.mysql.com for others to use )

It’d be more functional than FT’s current state!

[B]Quote:[/B]
I know if you do same for western languages indexing letters as words it would be horrible both from quality and search performance.

Yes but they are mutually exclusive. Treat western words as it already does, and CJK clyphs as I have demostrated in the code.

Put it this way:

FT search in MySQL is currently USELESS for CJK.

Implement something to split the Glyphs as in the code I have described and FT in mysql will be infinitely better than it’s current implementation in regards to CJK.

You did not understand me. I mean I do not know how characters are distributed in japanese/chinese and if it makes any sense to tread them as words. For western languages this would not know for Chinese - I do not know.

Speaking about “bad is better than nothing” I do not fully agree.

People know MySQL FullText Search does not support chinese so they can look for other solutions - which is Sienna, writing their own UDF, Using Sphinx,Using Lucene, Using stored procedure based full text search.

If it is supported in ugly way people may test it on table with 10 rows, be satisfied with results and get it into production which would turn complete disaster.

I agree it should not be implemented if the only way is a bad method.

Do you have any native Japanese/Chinese/Korean speakers who can give a better opinion?

I spoke to people on MySQL users conference, they use sienna which I mentioned which uses special toolkit for japanese stream processling. The simple method is using series of letters if I’m not mistaken - this is better than splitting by one letter if relevance is taking it into account - this way if more letters are matched it will be higher in results.

In any case you’re right. Whatever way you do you need some person speaking native language to evaluate results.

Do you know how sienna performs in comparison to regular full-text searches? Is there much of an overhead?

I do not know yet.

We recently tried to test it on Wikipedia dump (english version) but it could not index it in 24 hours so we aborted the test.

Developers however told us it is kind of bug with text parser which generates too large dictionary which makes it to require massive amounts of memory.

If you really need speed I would be looking at sphinx and lucene.

me know I would pass it to MySQL development team. Note MySQL Can’t include GPL code as it is incompatible with commercial license