averale at gmail
Dec 2, 2006, 1:41 PM
Post #3 of 4
2006/12/1, Marvin Humphrey <marvin [at] rectangular>:
> On Dec 1, 2006, at 8:09 AM, Alex Aver wrote:
> > I need to work with documents written in mixed languages:
> > english+russian, engish+french+russian, russian+japanese, etc.
> > But in KinoSearch I can select only one language for analyzer.
> > How can I index & search such documents?
> If you can divide up your documents into different fields by
> language, then you can specify an analyzer for each field.
It can be useful. But how can I _search_ in multiple languages?
> If not...
> well, looking at your language list, I'd say you're basically out of
> luck. It's not possible to create a Tokenizer which behaves sensibly
> when presented with text which might be either English, Japanese,
> French or Russian.
Why I can't use simple $word_char_tokenizer for this set of languages?
Universal stemmer for mixed texts it's problem. I can separate words
in latin & cyrillic characters and use special stemmer for Russian
words. But how can I separate English & French?
> Tokenizing Japanese is really, really hard
> anyway, and KinoSearch provides no native support for it.
Yes, tokenizing Japanese is hard, but possible - afair dpsearch &
mnogosearch can do index and search in Japanese. But it isn't critical
point at this moment ;)