Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: kinosearch: discuss

multilanguage indexing and search

 

 

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded


averale at gmail

Dec 1, 2006, 8:27 AM

Post #1 of 4 (321 views)
Permalink
multilanguage indexing and search

Hi.

I need to work with documents written in mixed languages:
english+russian, engish+french+russian, russian+japanese, etc.
But in KinoSearch I can select only one language for analyzer.
How can I index & search such documents?


marvin at rectangular

Dec 1, 2006, 10:43 AM

Post #2 of 4 (308 views)
Permalink
multilanguage indexing and search [In reply to]

On Dec 1, 2006, at 8:09 AM, Alex Aver wrote:

> I need to work with documents written in mixed languages:
> english+russian, engish+french+russian, russian+japanese, etc.
> But in KinoSearch I can select only one language for analyzer.
> How can I index & search such documents?

If you can divide up your documents into different fields by
language, then you can specify an analyzer for each field. If not...
well, looking at your language list, I'd say you're basically out of
luck. It's not possible to create a Tokenizer which behaves sensibly
when presented with text which might be either English, Japanese,
French or Russian. Tokenizing Japanese is really, really hard
anyway, and KinoSearch provides no native support for it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


averale at gmail

Dec 2, 2006, 1:41 PM

Post #3 of 4 (311 views)
Permalink
multilanguage indexing and search [In reply to]

2006/12/1, Marvin Humphrey <marvin [at] rectangular>:
>
> On Dec 1, 2006, at 8:09 AM, Alex Aver wrote:
>
> > I need to work with documents written in mixed languages:
> > english+russian, engish+french+russian, russian+japanese, etc.
> > But in KinoSearch I can select only one language for analyzer.
> > How can I index & search such documents?
>
> If you can divide up your documents into different fields by
> language, then you can specify an analyzer for each field.

It can be useful. But how can I _search_ in multiple languages?

> If not...
> well, looking at your language list, I'd say you're basically out of
> luck. It's not possible to create a Tokenizer which behaves sensibly
> when presented with text which might be either English, Japanese,
> French or Russian.

Why I can't use simple $word_char_tokenizer for this set of languages?

Universal stemmer for mixed texts it's problem. I can separate words
in latin & cyrillic characters and use special stemmer for Russian
words. But how can I separate English & French?

> Tokenizing Japanese is really, really hard
> anyway, and KinoSearch provides no native support for it.

Yes, tokenizing Japanese is hard, but possible - afair dpsearch &
mnogosearch can do index and search in Japanese. But it isn't critical
point at this moment ;)


hugues at mazancourt

Dec 3, 2006, 9:43 AM

Post #4 of 4 (310 views)
Permalink
multilanguage indexing and search [In reply to]

Le 2 d?c. 06 ? 22:23, Alex Aver a ?crit :

> 2006/12/1, Marvin Humphrey <marvin [at] rectangular>:
>>
>> On Dec 1, 2006, at 8:09 AM, Alex Aver wrote:
>>
> [...]
> Why I can't use simple $word_char_tokenizer for this set of languages?
>
> Universal stemmer for mixed texts it's problem. I can separate words
> in latin & cyrillic characters and use special stemmer for Russian
> words. But how can I separate English & French?

You don't necessarily need. 80% of the job an English stemmer does is
to remove "s"/"es" at the end of a word, wich works also fine for
French. The other rules won't hurt (such as s/ed$//) because they
don't match French words.
You can also add some French rules in your stemmer, such as s/aux$/
al/, wich won't have any effect on English words.

In fact, the most important thing is that you use the *same* stemmer
for indexing and querying. Whatever stemming it performs.

>
>> Tokenizing Japanese is really, really hard
>> anyway, and KinoSearch provides no native support for it.
>
> Yes, tokenizing Japanese is hard, but possible - afair dpsearch &
> mnogosearch can do index and search in Japanese. But it isn't critical
> point at this moment ;)

MnogosSearch uses ChaSen, a free japanese parser that has a Perl
front-end. See http://rpmfind.net/linux/RPM/suse/9.3/i386/suse/i586/
perl-Text-ChaSen-2.3.3-97.i586.html
More generally, there are some pointers on analyzing Japanese here :
http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/hoary/japanese/

Best,

Hugues

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.