Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 16, 2009, 9:29 AM

Post #1 of 21 (965 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778401#action_12778401 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, if you have a moment maybe you can review this one for me?

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 12:22 PM

Post #2 of 21 (945 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778504#action_12778504 ]

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

Robert, I assume you did use those weird chars in the test on purpose - I wonder if there are some "real" codepoints that we could use in the test?

The code looks good to me, this is the way to go for char lowercaseing with Unicode 4.0

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 12:26 PM

Post #3 of 21 (932 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778508#action_12778508 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, those "wierd" chars are indeed real codepoints that have lowercasing behavior in Unicode 4.0!

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 12:30 PM

Post #4 of 21 (931 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778509#action_12778509 ]

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

we might need a changes.txt entry here too?!

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 12:34 PM

Post #5 of 21 (929 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778510#action_12778510 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, yes see LUCENE-1689.
this is my question of the day, how are we handling this which is really a backwards break in a way, but honestly a bugfix because we should have supported Unicode 4.0 in Lucene 3.0, since thats the unicode version of java 5.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 12:38 PM

Post #6 of 21 (928 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778514#action_12778514 ]

Uwe Schindler commented on LUCENE-2069:
---------------------------------------

we can change it whenever we want, we must only supply a matchVersion switch....

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 12:42 PM

Post #7 of 21 (927 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778515#action_12778515 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Uwe, we can use matchVersion for all of this, this is true, and I will help.

but see my comment on LUCENE-1689 (since i feel it affects all the issues), it will result in a lot of code complexity. Just a warning.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 1:46 PM

Post #8 of 21 (911 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779156#action_12779156 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

if you want my vote, it is that we treat issues like this as bugs and not do all this Version stuff.

i supplied this patch (22KB versus 2KB) to show how even the smallest issue creates more complexity.
Also, read the javadocs for what Version does, it reads just like a bug:
* As of 3.1, supplementary characters are properly lowercased.

I mean, honestly, its not like we provided a back compat mechanism for 3.0,
where this behavior changed for lots of contrib that uses String-based methods, such as String toLowerCase (they return different results on JRE5 than JRE4)

but we can go either way, doesn't matter to me.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 1:50 PM

Post #9 of 21 (912 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779160#action_12779160 ]

Mark Miller commented on LUCENE-2069:
-------------------------------------

But we try and maintain index back compatibility with bugs too? We don't want terms to be lost in an index.

But it depends as always - if something has long been a problem and broken, then perhaps it doesn't make sense to bend over backwards about it now. We just have to look at everything, put the priority on making life best for users while balancing somewhat with dev/maintenance headaches and come to a consensus - easy ! :)

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 1:54 PM

Post #10 of 21 (913 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779164#action_12779164 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Mark, true, well give me some consensus so when 3.0 is released, we can start attacking these issues! :)

doesn't matter to me, I just present both alternatives! all i want is for us to make a decision.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 10:04 PM

Post #11 of 21 (900 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779319#action_12779319 ]

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

bq. Simon, those "wierd" chars are indeed real codepoints that have lowercasing behavior in Unicode 4.0!
thats what I guessed :D otherwise it would not work though :). I was just wondering if there are some more expressive once out there.

bq. Mark, true, well give me some consensus so when 3.0 is released, we can start attacking these issues!
+1

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 18, 2009, 7:50 AM

Post #12 of 21 (883 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779499#action_12779499 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

bq. But we try and maintain index back compatibility with bugs too?

Mark, you are right. The Version description says this: Match settings and bugs in Lucene's 3.0 release.
I guess we should at least try, I think we can do it.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 4:34 AM

Post #13 of 21 (739 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782393#action_12782393 ]

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

btw. this also works for CharArraySet - that way we can easily implement it with Version without duplicating any code. Readable, clean and compatible.

I will update the CharArraySet patch once I got comments on this.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 5:42 AM

Post #14 of 21 (742 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782432#action_12782432 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Hi Simon, this is a cool idea!

I need to think this through, can you think of other places (non-lowercasing) where we could use this?
Even if we can only use it there, I think it might still be a good idea to keep things simple.

I do think we should mark the class deprecated and only used for lucene back compat purposes if we decide to use it.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 5:56 AM

Post #15 of 21 (741 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782439#action_12782439 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon, i took a quick look at contrib analyzers, for example.
This utility class could make back compat easier for a lot of the code, i.e. unicode block calculations in the CJK code, greek diacritic/lowercase folding in the greek code, ...
I think we should go this route.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 6:35 AM

Post #16 of 21 (739 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782449#action_12782449 ]

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

I also found some others like
BrazilianStemmer
ChineseTokenizer
FrenchStemmer
DutchStemmer

and many more.... +1 for this from my side.
As this seems to be fundamental we should try to get it in sooner or later so we can get the rest going.

simon

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 1:18 PM

Post #17 of 21 (723 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782623#action_12782623 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

damn we have to use the limit form of codePointAt, just to be sure.

if term text truly ends with unpaired lead surrogate, codePointAt could pair it with leftover trash trail surrogate from a previous token...


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 1:52 PM

Post #18 of 21 (723 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782637#action_12782637 ]

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

bq. damn we have to use the limit form of codePointAt, just to be sure.
no we don't - at least not in this particular case

bq. if term text truly ends with unpaired lead surrogate, codePointAt could pair it with leftover trash trail surrogate from a previous token...

if this rare situation occurs the term length will still prevent the changed trail surrogate from being part of the token. This includes a super tiny overhead but I guess we can simply ignore this.

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 2:08 PM

Post #19 of 21 (721 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782649#action_12782649 ]

Simon Willnauer commented on LUCENE-2069:
-----------------------------------------

If we are too desperate about it I would suggest to have something like the following just above the loop:
{code}
if(buffer.length >= length)
buffer[length] = 0x00;
{code}



> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 27, 2009, 2:51 AM

Post #20 of 21 (680 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783094#action_12783094 ]

Uwe Schindler commented on LUCENE-2069:
---------------------------------------

Looks good, +1 to commit!

> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 27, 2009, 12:10 PM

Post #21 of 21 (672 views)
Permalink
[jira] Commented: (LUCENE-2069) fix LowerCaseFilter for unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783158#action_12783158 ]

Robert Muir commented on LUCENE-2069:
-------------------------------------

Simon you are right, there is no problem.

maybe for other things in the future we will need codePointAt() with the limit param, we could just add it to CharacterUtils if/when we need it.


> fix LowerCaseFilter for unicode 4.0
> -----------------------------------
>
> Key: LUCENE-2069
> URL: https://issues.apache.org/jira/browse/LUCENE-2069
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch, LUCENE-2069.patch
>
>
> lowercase suppl. characters correctly.
> this only fixes the filter, the LowerCaseTokenizer is part of a more complex issue (CharTokenizer)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.