Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Jul 1, 2009, 1:50 AM

Post #1 of 3 (338 views)
Permalink
[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.

[ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725972#action_12725972 ]

Rob ten Hove commented on LUCENE-1373:
--------------------------------------

Is it possible that when a property has a value that ends on "Type" like "InputFileType" is not indexed when the OS language is Dutch due to the same bug? I have two installations of Alfresco 3 Labs with Lucene 2.1.0 autoinstalled and with exactly the same installation options (English as language for Alfresco) the main difference next to the Hardware is the OS language. In both cases XP with SP2 but one English and the other Dutch. In the installation on the Dutch OS three properties with values ending on Type could not be found whereas they are present in the English version.

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
> Key: LUCENE-1373
> URL: https://issues.apache.org/jira/browse/LUCENE-1373
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis, contrib/analyzers
> Affects Versions: 2.3.2
> Reporter: Mark Lassau
> Priority: Minor
> Attachments: LUCENE-1373.patch
>
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 5, 2009, 9:45 PM

Post #2 of 3 (283 views)
Permalink
[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727390#action_12727390 ]

Mark Lassau commented on LUCENE-1373:
-------------------------------------

@Rob
This issue is about how Lucene parses ACRONYM tokens, which must contain a dot (eg "I.B.M."), and so you problem is certainly not exactly the same.

Whether it is related to some other issue with Lucene analysers for different languages is not clear.
It depends on the workings of your application, and I would suggest you contact the Alfresco developers with this question.

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
> Key: LUCENE-1373
> URL: https://issues.apache.org/jira/browse/LUCENE-1373
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis, contrib/analyzers
> Affects Versions: 2.3.2
> Reporter: Mark Lassau
> Priority: Minor
> Attachments: LUCENE-1373.patch
>
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jul 9, 2009, 4:09 AM

Post #3 of 3 (269 views)
Permalink
[jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12729181#action_12729181 ]

Rob ten Hove commented on LUCENE-1373:
--------------------------------------

@Mark, thanks for your reply on my question. So far the developers that worked on the application I was talking about were able to find a workaround. One thing is certain: the token analyzer mistreats the content... whether the content is an acronym or just plain text... seems that it tries to interpret the content of database elements a bit too much rather than just treat it as plain content...

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>
> Key: LUCENE-1373
> URL: https://issues.apache.org/jira/browse/LUCENE-1373
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis, contrib/analyzers
> Affects Versions: 2.3.2
> Reporter: Mark Lassau
> Priority: Minor
> Attachments: LUCENE-1373.patch
>
>
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.