Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

 

 

First page Previous page 1 2 3 4 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 24, 2009, 6:29 AM

Post #1 of 82 (874 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781950#action_12781950 ]

Robert Muir commented on LUCENE-2094:
-------------------------------------

Hi simon, at a glance your patch is ok.

I wonder though if we should try to consistently improve both this and LowerCaseFilter patch in the same way.
i have two ideas that might make it easier...? I am very inconsistent with these things myself so I guess we can try to make it consistent.

1.
{code}
for(int i=0;i<len;i++) {
if (Character.toLowerCase(text1[off+i]) != text2[i])
final int codePointAt = Character.codePointAt(text1, off+i);
if (Character.toLowerCase(codePointAt) != Character.codePointAt(text2, i))
return false;
if(codePointAt >= Character.MIN_SUPPLEMENTARY_CODE_POINT){
++i;
}
}
{code}

I wonder if instead loops like this should look like
{code}
for (int i =0; i < len; ) {
...
i += Character.charCount(codepoint);
}
{code}

2. I wonder if we should even add an if (supplementary) for things like lowercasing.
toLowerCase(ch) and toLowerCase(int) are most likely the same code anyway,
so we could just make the code easier to read.
{code}
for (int i = 0; i < len; ) {
i += Character.toChars(arr, ...
Character.toLowerCase(
Character.codePointAt(...)))
}
{code}


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 6:41 AM

Post #2 of 82 (847 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781959#action_12781959 ]

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

Robert, I tried to make it consistent to the LowerCaseFilter issues but I would vote +1 for both! This makes it much cleaner but we need to change the LowerCaseFilter one too!
I will quickly change my patch.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 6:41 AM

Post #3 of 82 (845 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781958#action_12781958 ]

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

Maybe we put this into UnicodeUtils (handling of toLowerCase etc for char[]).

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 6:43 AM

Post #4 of 82 (845 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781960#action_12781960 ]

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. Maybe we put this into UnicodeUtils (handling of toLowerCase etc for char[]).
I think calling those 3 methods should be fine without a utils method. We will see how it goes until the "end" of this whole issues I might change my mind.

simon

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 6:45 AM

Post #5 of 82 (848 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781962#action_12781962 ]

Robert Muir commented on LUCENE-2094:
-------------------------------------

Simon definitely, it is not a problem with your patch...
Thinking we can fix both to be clean.

btw, I have no idea if there is any performance difference between doing things this way.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 6:49 AM

Post #6 of 82 (846 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781964#action_12781964 ]

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. btw, I have no idea if there is any performance difference between doing things this way.
The change to charCount is pretty much the same as the if statement - this at least would not kill any performance.
The increment by 2 should also not be an issue. it is slightly slower than a ++ but this will be fine I guess.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 6:53 AM

Post #7 of 82 (844 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781965#action_12781965 ]

Robert Muir commented on LUCENE-2094:
-------------------------------------

simon yeah,

I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char)
in trunk ICU there is not even char-based methods, it is all int, where its a trie lookup, with a special fast-path array for linear access to Latin-1


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 7:01 AM

Post #8 of 82 (846 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781969#action_12781969 ]

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. I guess what I don't know, is if in the JDK Character.foo(int) is the same underlying stuff as Character.foo(char)
The JDK version of toLowerCase(char) for instance casts to int and calls the overloaded method.
public static boolean isLowerCase(char ch) {
return isLowerCase((int)ch);
}

That is the case all over the place as far as I can see.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 7:05 AM

Post #9 of 82 (845 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781971#action_12781971 ]

Robert Muir commented on LUCENE-2094:
-------------------------------------

Simon, yeah i just checked.
all the properties, behind the scenes are stored as int.
we shouldn't use any char-based methods pretending it will buy us any faster performance.
it will just make the code ugly and probably slower.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 7:29 AM

Post #10 of 82 (846 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781980#action_12781980 ]

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

question of the day - should we use Version or not :)



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 7:33 AM

Post #11 of 82 (846 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781983#action_12781983 ]

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

It would not hurt, the Set is only used for analyzers that all take a version param... It is not really a public API.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 7:57 AM

Post #12 of 82 (845 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781994#action_12781994 ]

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. It would not hurt, the Set is only used for analyzers that all take a version param... It is not really a public API.
So the thing here is that lowercasing for supplementary characters does only apply to a hand ful of chars see this link http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[%3ACase_Sensitive%3DTrue%3A]%26[^[\u0000-\uFFFF]]]&esc=on
Those characters are from the Deseret Alphabet (mormons) which means we are introducing a "pain in the neck" Version flag into CharArraySet for about 40 chars which would be broken?! I don't see this here! Nothing personal related to the Deseret Alphabet or anyone who is using it but this seem a bit too much of a hassle. It would make the code very ugly though.

simon



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 7:59 AM

Post #13 of 82 (848 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781995#action_12781995 ]

Robert Muir commented on LUCENE-2094:
-------------------------------------

Another option would be to list a back break in changes:

if you are indexing Deseret language, you should reindex.

we could remove the Version from LowerCaseFilter this way, too.
If you are indexing this language, things werent working right before so you surely wrote your own filters...?!

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 8:01 AM

Post #14 of 82 (856 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781998#action_12781998 ]

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

I would also break compat in LowerCaseFilter and bring out a large NOTE that if you index mormon you need to reindex.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 8:23 AM

Post #15 of 82 (815 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782007#action_12782007 ]

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

+1 for breaking backwards for these chars. From the web: there are only 4 books written in this charset (the books of mormon, see [http://en.wikipedia.org/wiki/Deseret_alphabet], [http://www.omniglot.com/writing/deseret.htm]), so it is rather seldom. People affected by this will for sure have their own analyzers.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 8:25 AM

Post #16 of 82 (815 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782009#action_12782009 ]

Robert Muir commented on LUCENE-2094:
-------------------------------------

Simon, yeah. its tricky you know, like many suppl. char issues.

even if we provide perfect backwards compatibility with what 3.0 did, if you care about these languages, you *WANT* to reindex, because stuff wasn't working at all before.
and if you really care, you weren't using any of lucene's analysis components anyway (except maybe WhitespaceTokenizer).
For example, StandardAnalyzer currently discards these characters anyway.

but we don't want to screw over CJK users where things might have been "mostly" working before, either.
In this case, CJK is completely unaffected, I think we should not use version here or in any other lowercasing fixes, including LowerCaseFilter itself.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 28, 2009, 12:55 PM

Post #17 of 82 (721 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783282#action_12783282 ]

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

Why do you use Version.LUCENE_CURRENT for all predefined stop word sets (ok, they do not need a match version, because they are already lowercased).

In my opinion the whole stuff is only needed for chararrayssets, which are not already lowercased. So is there any chararrayset in lucene with predefined stop-words, that is not lowercased)?

How about deprecating lowercasing at all and enforcing the stop lists to be lowercased before adding to an chararrayset? For current hard-coded sets, its no problem. And all File/Reader/... params to analyzers with lowercase could be deprecated and the user told to use the new ones which need already lowercased stop word sets.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 29, 2009, 4:02 AM

Post #18 of 82 (712 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783392#action_12783392 ]

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. Why do you use Version.LUCENE_CURRENT for all predefined stop word sets (ok, they do not need a match version, because they are already lowercased).

1. the do not ignore case at all so the version will not affect those sets.
2. they are private and we have the full control over the sets. The are all lowercased (as you figured correctly) and none of them contains any supplementary character.
3. The are static and private so passing any usersupplied version is not feasible.

bq. In my opinion the whole stuff is only needed for chararrayssets, which are not already lowercased. So is there any chararrayset in lucene with predefined stop-words, that is not lowercased)?
Either way, if the set is lowercased or not the lowercaseing is also applied to the values checked against the set.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 29, 2009, 4:20 AM

Post #19 of 82 (713 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783393#action_12783393 ]

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

bq. Either way, if the set is lowercased or not the lowercaseing is also applied to the values checked against the set.

If the LowerCaseFilter is applied before the stopwords, there is no need for doing irgnore-case-checking.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 29, 2009, 4:44 AM

Post #20 of 82 (710 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783394#action_12783394 ]

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. If the LowerCaseFilter is applied before the stopwords, there is no need for doing irgnore-case-checking.

no doubt! :) But if you do not want your terms to be lowercased but you do not care if "The" is at has an uppercase "T" you want this behaviour. Yet, either way we go we need the version somehow to preserve bw. compat.

We should rather think about breaking bw. compat for this particular language (deseret) but we have no idea what happens with unicode in the future. Its tough.



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 29, 2009, 4:46 AM

Post #21 of 82 (711 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783395#action_12783395 ]

Robert Muir commented on LUCENE-2094:
-------------------------------------

Hi Simon,

One thing I noticed is with this patch we get:
{code}
public StopFilter(Version matchVersion, boolean enablePositionIncrements, TokenStream input, Set<?> stopWords, boolean ignoreCase)
{code}

I know this is really not related to what you are doing here, but I wonder if instead StopFilter should look like this:
{code}
public StopFilter(Version matchVersion, TokenStream input, Set<?> stopWords, boolean ignoreCase)
{code}

and use matchVersion to determine enablePositionIncrements.

I think its already wierd how to create a stopfilter, you have to pass version to a static method getEnablePositionIncrementsVersionDefault. I don't think the user should have to pass Version twice:
{code}
new StopFilter(Version.WHATEVER, StopFilter.getEnablePositionIncrementsVersionDefault(Version.WHATEVER), ...)
{code}

I guess i think this getEnablePositionIncrementsVersionDefault should be deprecated along with the ctors that take this boolean argument, and it should all be driven off a single Version argument for simplicity


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 29, 2009, 4:54 AM

Post #22 of 82 (709 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783396#action_12783396 ]

Uwe Schindler commented on LUCENE-2094:
---------------------------------------

Mike didn't wanted to add matchVersion to StopFilter at this time, but when we change this, we should remove this static method or deprecate it and not use it anymore in the code. Instead use the matchVersion everywhere.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 29, 2009, 5:00 AM

Post #23 of 82 (712 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783399#action_12783399 ]

Robert Muir commented on LUCENE-2094:
-------------------------------------

Uwe, yeah, that is what I was thinking.
I guess I think an alternate ctor that allows explicit control of this with a boolean is ok,
but I think if you want the "defaults" it should just be with Version.

This really doesn't have a lot to do with Simon's patch but it becomes noticeable now.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 29, 2009, 5:24 AM

Post #24 of 82 (711 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783402#action_12783402 ]

Michael McCandless commented on LUCENE-2094:
--------------------------------------------

bq. I guess i think this getEnablePositionIncrementsVersionDefault should be deprecated along with the ctors that take this boolean argument, and it should all be driven off a single Version argument for simplicity

OK, I agree, let's also push Version down into StopFilter (to get posIncr setting).

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 29, 2009, 6:09 AM

Post #25 of 82 (711 views)
Permalink
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783406#action_12783406 ]

Simon Willnauer commented on LUCENE-2094:
-----------------------------------------

bq. I guess i think this getEnablePositionIncrementsVersionDefault should be deprecated along with the ctors that take this boolean argument, and it should all be driven off a single Version argument for simplicity

This is one thing I thought about too - I did not change it to keep the noise as low as possible in the patch but if we want to do it we can do in this patch too.

The question if we want to drop bw. compat and simply update CharArraySet to Unicode 4.0 seems to be more important. But IMO if we push Version to StopFilter we can also make CharArraySet using Version though.

thoughts?

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

First page Previous page 1 2 3 4 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.