Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

a question for french analyzer

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


chris.lu at gmail

Jul 30, 2007, 11:06 AM

Post #1 of 7 (612 views)
Permalink
a question for french analyzer

Hi,

I am not a French speaker, but here are some questions regarding
French analyzer:

Is there any analyzer that can do this? Analyze accentuated letters to
non accentuated corresponding letters (é,è,ê,ë -> e), so that

search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
and
search "fenetre" found the same result, all docs with "fenêtre" or "fenetre"

Current analyzers, Snowball-French and FrenchAnalyzer don't have this feature.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Jul 30, 2007, 11:18 AM

Post #2 of 7 (574 views)
Permalink
Re: a question for french analyzer [In reply to]

Gosh, I sure hope not, because that would mean that we rolled our
own for no good reason. We wound up just collapsing
the input stream by substituting plain old 'e' for all the accented
variants before indexing and before searching. Be *really* careful
what character set you're using.

Actually, we would have still had to roll our own because the
character mapping was...er...wonky <G>....

You have to store the data raw for display purposes if you want the
accents to show though...

Best
Erick

On 7/30/07, Chris Lu <chris.lu [at] gmail> wrote:
>
> Hi,
>
> I am not a French speaker, but here are some questions regarding
> French analyzer:
>
> Is there any analyzer that can do this? Analyze accentuated letters to
> non accentuated corresponding letters (é,è,ê,ë -> e), so that
>
> search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
> and
> search "fenetre" found the same result, all docs with "fenêtre" or
> "fenetre"
>
> Current analyzers, Snowball-French and FrenchAnalyzer don't have this
> feature.
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
>
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


samir.abdou at unine

Jul 30, 2007, 11:18 AM

Post #3 of 7 (578 views)
Permalink
RE: a question for french analyzer [In reply to]

Hi,

Take a look to the class ISOLatin1AccentFilter ! Add this to your analyzer
and it should work !

Hope this will help,
Samir

-----Message d'origine-----
De : Chris Lu [mailto:chris.lu [at] gmail]
Envoyé : lundi, 30. juillet 2007 20:06
À : java-user [at] lucene
Objet : a question for french analyzer

Hi,

I am not a French speaker, but here are some questions regarding
French analyzer:

Is there any analyzer that can do this? Analyze accentuated letters to
non accentuated corresponding letters (é,è,ê,ë -> e), so that

search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
and
search "fenetre" found the same result, all docs with "fenêtre" or "fenetre"

Current analyzers, Snowball-French and FrenchAnalyzer don't have this
feature.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
inutes

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


chris.lu at gmail

Jul 30, 2007, 1:31 PM

Post #4 of 7 (563 views)
Permalink
Re: a question for french analyzer [In reply to]

Hi, Samir,

Thanks a lot for this tip! It works great!

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 7/30/07, Samir Abdou <samir.abdou [at] unine> wrote:
> Hi,
>
> Take a look to the class ISOLatin1AccentFilter ! Add this to your analyzer
> and it should work !
>
> Hope this will help,
> Samir
>
> -----Message d'origine-----
> De: Chris Lu [mailto:chris.lu [at] gmail]
> Envoyé: lundi, 30. juillet 2007 20:06
> À: java-user [at] lucene
> Objet: a question for french analyzer
>
> Hi,
>
> I am not a French speaker, but here are some questions regarding
> French analyzer:
>
> Is there any analyzer that can do this? Analyze accentuated letters to
> non accentuated corresponding letters (é,è,ê,ë -> e), so that
>
> search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
> and
> search "fenetre" found the same result, all docs with "fenêtre" or "fenetre"
>
> Current analyzers, Snowball-French and FrenchAnalyzer don't have this
> feature.
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
> inutes
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


chris.lu at gmail

Jul 30, 2007, 1:36 PM

Post #5 of 7 (564 views)
Permalink
Re: a question for french analyzer [In reply to]

Hi, Erick,

I added ISOLatin1AccentFilter to FrenchAnalyzer following Samir's tip,
and it works great! And I think it's the right way to go. Problems
like "You have to store the data raw for display purposes if you want
the accents to show though" will go away since Analyzer already have
the original text and analyzed token mechanism built-in. And it's
pretty easy to do!

However, is there any special case that you have? Not really knowing
French, I only tested one word, "fenêtre", and it's analyzed into
"fenetre".

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes


On 7/30/07, Erick Erickson <erickerickson [at] gmail> wrote:
> Gosh, I sure hope not, because that would mean that we rolled our
> own for no good reason. We wound up just collapsing
> the input stream by substituting plain old 'e' for all the accented
> variants before indexing and before searching. Be *really* careful
> what character set you're using.
>
> Actually, we would have still had to roll our own because the
> character mapping was...er...wonky <G>....
>
> You have to store the data raw for display purposes if you want the
> accents to show though...
>
> Best
> Erick
>
> On 7/30/07, Chris Lu <chris.lu [at] gmail> wrote:
> >
> > Hi,
> >
> > I am not a French speaker, but here are some questions regarding
> > French analyzer:
> >
> > Is there any analyzer that can do this? Analyze accentuated letters to
> > non accentuated corresponding letters (é,è,ê,ë -> e), so that
> >
> > search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
> > and
> > search "fenetre" found the same result, all docs with "fenêtre" or
> > "fenetre"
> >
> > Current analyzers, Snowball-French and FrenchAnalyzer don't have this
> > feature.
> >
> > --
> > Chris Lu
> > -------------------------
> > Instant Scalable Full-Text Search On Any Database/Application
> > site: http://www.dbsight.net
> > demo: http://search.dbsight.com
> > Lucene Database Search in 3 minutes:
> >
> > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


renaud.waldura at library

Jul 30, 2007, 2:25 PM

Post #6 of 7 (558 views)
Permalink
RE: a question for french analyzer [In reply to]

Being a French speaker, I will mention the following special cases:

- "plus ça change" -> "plus ca change"
- "œuf" -> "oeuf"
- "lætitia" -> "laetitia"

But I just looked, and it looks like ISOLatin1AccentFilter handles these.
Better test to be sure...

--Renaud


-----Original Message-----
From: Chris Lu [mailto:chris.lu [at] gmail]
Sent: Monday, July 30, 2007 1:36 PM
To: java-user [at] lucene
Subject: Re: a question for french analyzer

Hi, Erick,

I added ISOLatin1AccentFilter to FrenchAnalyzer following Samir's tip, and
it works great! And I think it's the right way to go. Problems like "You
have to store the data raw for display purposes if you want the accents to
show though" will go away since Analyzer already have the original text and
analyzed token mechanism built-in. And it's pretty easy to do!

However, is there any special case that you have? Not really knowing French,
I only tested one word, "fenêtre", and it's analyzed into "fenetre".

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
inutes


On 7/30/07, Erick Erickson <erickerickson [at] gmail> wrote:
> Gosh, I sure hope not, because that would mean that we rolled our
> own for no good reason. We wound up just collapsing
> the input stream by substituting plain old 'e' for all the accented
> variants before indexing and before searching. Be *really* careful
> what character set you're using.
>
> Actually, we would have still had to roll our own because the
> character mapping was...er...wonky <G>....
>
> You have to store the data raw for display purposes if you want the
> accents to show though...
>
> Best
> Erick
>
> On 7/30/07, Chris Lu <chris.lu [at] gmail> wrote:
> >
> > Hi,
> >
> > I am not a French speaker, but here are some questions regarding
> > French analyzer:
> >
> > Is there any analyzer that can do this? Analyze accentuated letters to
> > non accentuated corresponding letters (é,è,ê,ë -> e), so that
> >
> > search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
> > and
> > search "fenetre" found the same result, all docs with "fenêtre" or
> > "fenetre"
> >
> > Current analyzers, Snowball-French and FrenchAnalyzer don't have this
> > feature.
> >
> > --
> > Chris Lu
> > -------------------------
> > Instant Scalable Full-Text Search On Any Database/Application
> > site: http://www.dbsight.net
> > demo: http://search.dbsight.com
> > Lucene Database Search in 3 minutes:
> >
> >
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_m
inutes
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Jul 30, 2007, 5:13 PM

Post #7 of 7 (554 views)
Permalink
Re: a question for french analyzer [In reply to]

<<<However, is there any special case that you have?>>>

Yes, the character set we use is, as I remember,
MARC-8. Which I don't think is the ISOLatin....,
but since I didn't know about that filter when we had our problem,
I didn't even look. Oh well, smarter/braver/lazier next time <G>...

Which is why I love this list, I find things like this and look
smarter next time something similar comes up <G>.

Thanks
Erick

On 7/30/07, Chris Lu <chris.lu [at] gmail> wrote:
>
> Hi, Erick,
>
> I added ISOLatin1AccentFilter to FrenchAnalyzer following Samir's tip,
> and it works great! And I think it's the right way to go. Problems
> like "You have to store the data raw for display purposes if you want
> the accents to show though" will go away since Analyzer already have
> the original text and analyzed token mechanism built-in. And it's
> pretty easy to do!
>
> However, is there any special case that you have? Not really knowing
> French, I only tested one word, "fenêtre", and it's analyzed into
> "fenetre".
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
>
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>
>
> On 7/30/07, Erick Erickson <erickerickson [at] gmail> wrote:
> > Gosh, I sure hope not, because that would mean that we rolled our
> > own for no good reason. We wound up just collapsing
> > the input stream by substituting plain old 'e' for all the accented
> > variants before indexing and before searching. Be *really* careful
> > what character set you're using.
> >
> > Actually, we would have still had to roll our own because the
> > character mapping was...er...wonky <G>....
> >
> > You have to store the data raw for display purposes if you want the
> > accents to show though...
> >
> > Best
> > Erick
> >
> > On 7/30/07, Chris Lu <chris.lu [at] gmail> wrote:
> > >
> > > Hi,
> > >
> > > I am not a French speaker, but here are some questions regarding
> > > French analyzer:
> > >
> > > Is there any analyzer that can do this? Analyze accentuated letters to
> > > non accentuated corresponding letters (é,è,ê,ë -> e), so that
> > >
> > > search "fenêtre" (=window) found all docs with "fenêtre" or "fenetre"
> > > and
> > > search "fenetre" found the same result, all docs with "fenêtre" or
> > > "fenetre"
> > >
> > > Current analyzers, Snowball-French and FrenchAnalyzer don't have this
> > > feature.
> > >
> > > --
> > > Chris Lu
> > > -------------------------
> > > Instant Scalable Full-Text Search On Any Database/Application
> > > site: http://www.dbsight.net
> > > demo: http://search.dbsight.com
> > > Lucene Database Search in 3 minutes:
> > >
> > >
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > > For additional commands, e-mail: java-user-help [at] lucene
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.