Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Hindi, diacritics and search results

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


osya_bender at hotmail

Jul 10, 2009, 12:10 PM

Post #1 of 5 (490 views)
Permalink
Hindi, diacritics and search results

Hi All,



I'm using the default setup of lucene (no custom analyzers configured) and
came across the following issue:

In Hindi if there is a letter with a diacritic in a phrase lucene will find
the phrase with this letter even if the search string is for the letter
without a diacritics.

Is this an expected behavior? Maybe this is standard for all languages with
letters that have diacritics?



From pure byte standpoint I can see the logic, the letter with diacritics
takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes 3 (E0 A4 95)
so if I search for *some_letter* where some letter has code (E0 A4 95)
lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.



Any comments much appreciated.



Thanks.


rcmuir at gmail

Jul 10, 2009, 3:13 PM

Post #2 of 5 (462 views)
Permalink
Re: Hindi, diacritics and search results [In reply to]

Which analyzer in particular are you using?

Its probably not doing what you want for hindi. These "diacritics" are
important (vowels, etc).


On Fri, Jul 10, 2009 at 3:10 PM, OBender<osya_bender[at]hotmail.com> wrote:
> Hi All,
>
>
>
> I'm using the default setup of lucene (no custom analyzers configured) and
> came across the following issue:
>
> In Hindi if there is a letter with a diacritic in a phrase lucene will find
> the phrase with this letter even if the search string is for the letter
> without a diacritics.
>
> Is this an expected behavior? Maybe this is standard for all languages with
> letters that have diacritics?
>
>
>
> From pure byte standpoint I can see the logic, the letter with diacritics
> takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes  3 (E0 A4 95)
> so if I search for *some_letter* where some letter has code (E0 A4 95)
> lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.
>
>
>
> Any comments much appreciated.
>
>
>
> Thanks.
>
>
>
>



--
Robert Muir
rcmuir[at]gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


osya_bender at hotmail

Jul 10, 2009, 6:13 PM

Post #3 of 5 (454 views)
Permalink
RE: Hindi, diacritics and search results [In reply to]

I'm using default analyzer. Actually one that is set by default by Compass framework but I assume it is the same that would be used in Lucene by default.
Which one should I use?

-----Original Message-----
From: Robert Muir [mailto:rcmuir[at]gmail.com]
Sent: Friday, July 10, 2009 6:13 PM
To: java-user[at]lucene.apache.org
Subject: Re: Hindi, diacritics and search results

Which analyzer in particular are you using?

Its probably not doing what you want for hindi. These "diacritics" are
important (vowels, etc).


On Fri, Jul 10, 2009 at 3:10 PM, OBender<osya_bender[at]hotmail.com> wrote:
> Hi All,
>
>
>
> I'm using the default setup of lucene (no custom analyzers configured) and
> came across the following issue:
>
> In Hindi if there is a letter with a diacritic in a phrase lucene will find
> the phrase with this letter even if the search string is for the letter
> without a diacritics.
>
> Is this an expected behavior? Maybe this is standard for all languages with
> letters that have diacritics?
>
>
>
> From pure byte standpoint I can see the logic, the letter with diacritics
> takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes 3 (E0 A4 95)
> so if I search for *some_letter* where some letter has code (E0 A4 95)
> lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.
>
>
>
> Any comments much appreciated.
>
>
>
> Thanks.
>
>
>
>



--
Robert Muir
rcmuir[at]gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


Checked by AVG - www.avg.com
Version: 8.5.375 / Virus Database: 270.13.0/2209 - Release Date: 07/10/09 17:57:00


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


rcmuir at gmail

Jul 10, 2009, 7:35 PM

Post #4 of 5 (447 views)
Permalink
Re: Hindi, diacritics and search results [In reply to]

there is really no default in lucene

a good start for hindi would be to try WhitespaceAnalyzer.

On Fri, Jul 10, 2009 at 9:13 PM, OBender Hotmail<osya_bender[at]hotmail.com> wrote:
> I'm using default analyzer. Actually one that is set by default by Compass framework but I assume it is the same that would be used in Lucene by default.
> Which one should I use?
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir[at]gmail.com]
> Sent: Friday, July 10, 2009 6:13 PM
> To: java-user[at]lucene.apache.org
> Subject: Re: Hindi, diacritics and search results
>
> Which analyzer in particular are you using?
>
> Its probably not doing what you want for hindi. These "diacritics" are
> important (vowels, etc).
>
>
> On Fri, Jul 10, 2009 at 3:10 PM, OBender<osya_bender[at]hotmail.com> wrote:
>> Hi All,
>>
>>
>>
>> I'm using the default setup of lucene (no custom analyzers configured) and
>> came across the following issue:
>>
>> In Hindi if there is a letter with a diacritic in a phrase lucene will find
>> the phrase with this letter even if the search string is for the letter
>> without a diacritics.
>>
>> Is this an expected behavior? Maybe this is standard for all languages with
>> letters that have diacritics?
>>
>>
>>
>> From pure byte standpoint I can see the logic, the letter with diacritics
>> takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes  3 (E0 A4 95)
>> so if I search for *some_letter* where some letter has code (E0 A4 95)
>> lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.
>>
>>
>>
>> Any comments much appreciated.
>>
>>
>>
>> Thanks.
>>
>>
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir[at]gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>
> Checked by AVG - www.avg.com
> Version: 8.5.375 / Virus Database: 270.13.0/2209 - Release Date: 07/10/09 17:57:00
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>



--
Robert Muir
rcmuir[at]gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


dioxide.software at gmail

Jul 13, 2009, 3:36 AM

Post #5 of 5 (417 views)
Permalink
Re: Hindi, diacritics and search results [In reply to]

Apart from using WhiteSpaceAnalyzer which will tokenize words based on
spaces, you can try writing a simple custom analyzer which'll a bit more. I
did the following for handling Indic languages intermingled with English
content,

/**
* Analyzer for Indian language.
*/
public class IndicAnalyzerIndex extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream ts = new WhitespaceTokenizer(reader);
/**
* @param ts, token stream
* @param generateWordParts If 1, causes parts of words to be
generated: "PowerShot" => "Power" "Shot"
* @param generateNumberParts If 1, causes number subwords to be
generated: "500-42" => "500" "42"
* @param catenateWords 1, causes maximum runs of word parts to be
catenated: "wi-fi" => "wifi"
* @param catenateNumbers If 1, causes maximum runs of number parts
to be catenated: "500-42" => "50042"
* @param catenateAll If 1, causes all subword parts to be catenated:
"wi-fi-4000" => "wifi4000"
*/
ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
ts = new LowerCaseFilter(ts);
ts = new PorterStemFilter(ts);
return ts;
}
}

The above is for indexing, for querying you can just use the following
values for the worddelimiterfilter constructor, keeping the rest of the
things same,
ts = new WordDelimiterFilter(ts, 1, 1, 0, 0, 0);

I pulled the "worddelimterfilter" class from Solr nightly build, as nothing
as such is available in Lucene, AFAIK.

In my case its working perfectly fine for all indian languages mixed with
english content. As you can see for english it applies the usual process of
stemming/stop-word-removal etc. Try it out and do let us know if you face
any issues.

Thanks,
KK.

On Sat, Jul 11, 2009 at 8:05 AM, Robert Muir <rcmuir[at]gmail.com> wrote:

> there is really no default in lucene
>
> a good start for hindi would be to try WhitespaceAnalyzer.
>
> On Fri, Jul 10, 2009 at 9:13 PM, OBender Hotmail<osya_bender[at]hotmail.com>
> wrote:
> > I'm using default analyzer. Actually one that is set by default by
> Compass framework but I assume it is the same that would be used in Lucene
> by default.
> > Which one should I use?
> >
> > -----Original Message-----
> > From: Robert Muir [mailto:rcmuir[at]gmail.com]
> > Sent: Friday, July 10, 2009 6:13 PM
> > To: java-user[at]lucene.apache.org
> > Subject: Re: Hindi, diacritics and search results
> >
> > Which analyzer in particular are you using?
> >
> > Its probably not doing what you want for hindi. These "diacritics" are
> > important (vowels, etc).
> >
> >
> > On Fri, Jul 10, 2009 at 3:10 PM, OBender<osya_bender[at]hotmail.com> wrote:
> >> Hi All,
> >>
> >>
> >>
> >> I'm using the default setup of lucene (no custom analyzers configured)
> and
> >> came across the following issue:
> >>
> >> In Hindi if there is a letter with a diacritic in a phrase lucene will
> find
> >> the phrase with this letter even if the search string is for the letter
> >> without a diacritics.
> >>
> >> Is this an expected behavior? Maybe this is standard for all languages
> with
> >> letters that have diacritics?
> >>
> >>
> >>
> >> From pure byte standpoint I can see the logic, the letter with
> diacritics
> >> takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes 3 (E0 A4
> 95)
> >> so if I search for *some_letter* where some letter has code (E0 A4 95)
> >> lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.
> >>
> >>
> >>
> >> Any comments much appreciated.
> >>
> >>
> >>
> >> Thanks.
> >>
> >>
> >>
> >>
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir[at]gmail.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> > For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >
> >
> > Checked by AVG - www.avg.com
> > Version: 8.5.375 / Virus Database: 270.13.0/2209 - Release Date: 07/10/09
> 17:57:00
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> > For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >
> >
>
>
>
> --
> Robert Muir
> rcmuir[at]gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.