
averale at gmail
Dec 2, 2006, 1:41 PM
Post #3 of 4
(217 views)
Permalink
|
2006/12/1, Marvin Humphrey <marvin [at] rectangular>: > > On Dec 1, 2006, at 8:09 AM, Alex Aver wrote: > > > I need to work with documents written in mixed languages: > > english+russian, engish+french+russian, russian+japanese, etc. > > But in KinoSearch I can select only one language for analyzer. > > How can I index & search such documents? > > If you can divide up your documents into different fields by > language, then you can specify an analyzer for each field. It can be useful. But how can I _search_ in multiple languages? > If not... > well, looking at your language list, I'd say you're basically out of > luck. It's not possible to create a Tokenizer which behaves sensibly > when presented with text which might be either English, Japanese, > French or Russian. Why I can't use simple $word_char_tokenizer for this set of languages? Universal stemmer for mixed texts it's problem. I can separate words in latin & cyrillic characters and use special stemmer for Russian words. But how can I separate English & French? > Tokenizing Japanese is really, really hard > anyway, and KinoSearch provides no native support for it. Yes, tokenizing Japanese is hard, but possible - afair dpsearch & mnogosearch can do index and search in Japanese. But it isn't critical point at this moment ;)
|