Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 24, 2009, 7:01 AM

Post #1 of 8 (251 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781969#action_12781969 ]

Simon Willnauer edited comment on LUCENE-2094 at 11/24/09 3:00 PM:
-------------------------------------------------------------------

bq. I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char)
The JDK version of toLowerCase(char) for instance casts to int and calls the overloaded method.
{code}
public static boolean isLowerCase(char ch) {
return isLowerCase((int)ch);
}
{code}

That is the case all over the place as far as I can see.

was (Author: simonw):
bq. I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char)
The JDK version of toLowerCase(char) for instance casts to int and calls the overloaded method.
public static boolean isLowerCase(char ch) {
return isLowerCase((int)ch);
}

That is the case all over the place as far as I can see.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 7:09 AM

Post #2 of 8 (242 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781971#action_12781971 ]

Robert Muir edited comment on LUCENE-2094 at 11/24/09 3:09 PM:
---------------------------------------------------------------

Simon, yeah i just checked.
all the properties, behind the scenes are stored as int.
we shouldn't use any char-based methods pretending it will buy us any faster performance.
it will just make the code ugly and probably slower.

slower meaning, the "if" itself in the lowercasefilter patch, it can now be removed.


was (Author: rcmuir):
Simon, yeah i just checked.
all the properties, behind the scenes are stored as int.
we shouldn't use any char-based methods pretending it will buy us any faster performance.
it will just make the code ugly and probably slower.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 29, 2009, 4:56 AM

Post #3 of 8 (211 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783396#action_12783396 ]

Uwe Schindler edited comment on LUCENE-2094 at 11/29/09 12:56 PM:
------------------------------------------------------------------

Mike didn't want to add matchVersion to StopFilter at this time, but when we change this, we should remove this static method or deprecate it and not use it anymore in the code. Instead use only matchVersion everywhere and eliminate the enablePosIncr setting at all.

was (Author: thetaphi):
Mike didn't wanted to add matchVersion to StopFilter at this time, but when we change this, we should remove this static method or deprecate it and not use it anymore in the code. Instead use the matchVersion everywhere.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 10:57 AM

Post #4 of 8 (200 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783789#action_12783789 ]

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 6:55 PM:
-----------------------------------------------------------------

bq. Why? What undesirable things happen if the QueryParser has enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

We coupled it to Version in 2.9. If you create the StopFilter with Version.LUCENE_29 it is enabled. If you pass this version to QP, it's enabled, too. Very simple?

Solr should make Version a property to all factories and create all Filters/Parsers using that flag. Thats why we implemented Version (to get rid of all these strange boolean flags). Just use Version.valueOf(property) and use the result to create your filters. It is now implemented everywhere in Lucene Core and Contrib (Version.valueOf() would not work in 2.9, because Version extends Parameter there, but in 3.0 it's an enum)

was (Author: thetaphi):
bq. Why? What undesirable things happen if the QueryParser has enablePositionIncrements(true) with a StopFilter that doesn't produce gaps?

We coupled it to Version in 2.9. If you create the StopFilter with Version.LUCENE_29 it is enabled. If you pass this version to QP, it's enabled, too. Very simple?

Solr should make Version a property to all factories and create all Filters/Parsers using that flag. Thats why we implemented Version (to get rid of all these strange boolean flags). Just use Version.valueOf(property) and use the result to create your filters. It is now implemented everywhere in Lucene Core and Contrib.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 11:26 AM

Post #5 of 8 (200 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783806#action_12783806 ]

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 7:25 PM:
-----------------------------------------------------------------

Yes it causes. If you have an old index without posincr, the query parser would produce queries that do not work (we had this issue in 2.9.1 shortly before release, one of the reasons why it was delayed).

The version flag is for backwards compatibility. If you do not reinex with a new Version constant you should use the old version constant everywhere and things will play happy together. Even solr users will have old indexes, and for them there should be a property to specify the version constant (using this valueOf of enums). Solr should then create all components that require a version (and since 3.0 *all* analyzers need this) using this property. And then everything will play wonderful together (anayzers, query parser and so on).

Also Highlighter had a problem with it (same issue with the QP problem in pre-2.9.1)!

was (Author: thetaphi):
Yes it causes. If you have an old index without posincr, the query parser would produce queries that do not work (we had this issue in 2.9.1 shortly before release, one of the reasons why it was delayed).

The version flag is for backwards compatibility. If you do not reinex with a new Version constant you should use the old version constant everywhere and things will play happy together. Even solr users will have old indexes, and for them there should be a property to specify the version constant (using this valueOf of enums). Solr should then create all components that require a version (and since 3.0 *all* analyzers need this) using this property. And then everything will play wonderful together (anayzers, query parser and so on).

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 12:16 PM

Post #6 of 8 (200 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783829#action_12783829 ]

Yonik Seeley edited comment on LUCENE-2094 at 11/30/09 8:15 PM:
----------------------------------------------------------------

bq. Yes it causes. If you have an old index without posincr, the query parser would produce queries that do not work

Oh, wait, is this because things like StandardAnalyzer changed the default? Seems like that's where the back comat break should have been addressed (and it was)... water under the bridge at this point though.

was (Author: yseeley [at] gmail):
bq. Yes it causes. If you have an old index without posincr, the query parser would produce queries that do not work

Oh, wait, is this because things like StandardAnalyzer changed the default? Seems like that's where the back comat break should have been addressed... water under the bridge at this point though.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 1:53 PM

Post #7 of 8 (200 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783861#action_12783861 ]

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 9:52 PM:
-----------------------------------------------------------------

Committed revision: 885592

I keep this open for further discussing. The Version ctor param is now everywhere and it is better than goiving a boolean to *every* analyzer that uses StopFilter. And that was the reason for creating the Version constants in 2.9.

bq. So I think that's the question... is it a bug or a feature?

It is a bug. Everybody should update the code and raise the version constant to 31.

was (Author: thetaphi):
Committed revision: 885592

I keep this open for further discussing. The Version ctor param is now everywhere and it is better than goiving a boolean to *every* analyzer that uses StopFilter. And that was the reason for creating the Version constants in 2.9.

bq. So I think that's the question... is it a bug or a feature? Everybody should update the code and raise the version constant to 31

It is a bug.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 3:08 PM

Post #8 of 8 (202 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783926#action_12783926 ]

Uwe Schindler edited comment on LUCENE-2094 at 11/30/09 11:08 PM:
------------------------------------------------------------------

bq. But if the PhraseQuery is generated with QueryParser also preserving holes, then it works properly?

Yes, I tested this before 2.9.1 (one reason why you had to respin).

QueryParser also still has the get/set for posIncr but also takes the matchVersion. Here it is the other way round, the ctor uses the default with Version and you can change it by a setter later (which is still not deprecated and available in 3.0).

In my opinion we should go that way (which is against Robert's opinion). The ctor taking two booleans is very bad...

was (Author: thetaphi):
bq. But if the PhraseQuery is generated with QueryParser also preserving holes, then it works properly?

Yes, I tested this before 2.9.1 (one reason why you had to respin).

QueryParser also still has the get/set for posIncr but also takes the matchVersion. Here it is the other way round, the ctor uses the default with Version and you can change it by a setter later (which is still not deprecated and available in 3.0).

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Simon Willnauer
> Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.