Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 24, 2009, 6:21 AM

Post #1 of 10 (455 views)
Permalink
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

Attachment: LUCENE-2094.txt

This patch contains a testcase and a fixed CharArraySet. Yet this does not use Version to preserve compatibility. I bring this patch up to start the discussion how we should handle this particular case.
Using version would not be that much of an issue as all Analyzers using a CharArraySet do have the Version class already.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 6:47 AM

Post #2 of 10 (437 views)
Permalink
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

Attachment: LUCENE-2094.txt

Changed loop to use Charater.charCount()

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 7:11 AM

Post #3 of 10 (437 views)
Permalink
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

Attachment: LUCENE-2094.txt

Added some more tests including single highsurrogate chars.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 28, 2009, 11:45 AM

Post #4 of 10 (406 views)
Permalink
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

Attachment: LUCENE-2094.patch

This patch uses CharacterUtils and Version to preserve backwards compatibility. It has grown to a very large patch and changes a lot of stuff in core too. I'm not sure if this is the best way to go with the limited usecase in mind. - Only Deseret language has upper / lowercase pairs which are not in the BMP. Yet this could change in the future - who knows that way we could get rid of the deprecated methodes little quicker...

From a backwards policy perspective we should do it that way.



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 1:55 AM

Post #5 of 10 (389 views)
Permalink
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

Attachment: LUCENE-2094.patch

I updated the patch to use Version in StopFilter. This seems to be reasonable though.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 9:07 AM

Post #6 of 10 (383 views)
Permalink
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

Attachment: LUCENE-2094.patch

I hope we made it with this patch - don't want to keep this growing.
I fixed a problem in CharArraySet (equals / getHashCode) with limits which is also the reason why CharacterUtils now has a codePointAt(char[], offset, limit) method.
This patch also moves Version into StopFilter but exposes an expert ctor to set the posInc manually.

happy reviewing

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 11:36 AM

Post #7 of 10 (386 views)
Permalink
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

Attachment: LUCENE-2094.patch

Changed the StopFilter(..,posInc,..) ctor to private for convenience.



> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 12:00 PM

Post #8 of 10 (383 views)
Permalink
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer updated LUCENE-2094:
------------------------------------

Attachment: LUCENE-2094.patch

updated patch to trunk - uwe on heavy committing

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
> Reporter: Simon Willnauer
> Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 1:57 PM

Post #9 of 10 (384 views)
Permalink
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2094:
----------------------------------

Affects Version/s: (was: 3.0.1)
(was: 2.9.2)
(was: 2.9.1)
(was: 3.1)
(was: 2.4.2)
(was: 2.4.1)
(was: 2.3.3)
(was: 2.3.2)
(was: 2.3.1)
(was: 2.9)
(was: 2.4)
(was: 2.3)
(was: 2.2)
(was: 2.1)
(was: 2.0.0)
(was: 1.9)

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Simon Willnauer
> Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 2:31 PM

Post #10 of 10 (381 views)
Permalink
[jira] Updated: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0 [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2094:
--------------------------------

Attachment: LUCENE-2094.patch

attached is my proposal mentioned in the comments above.

> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Simon Willnauer
> Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase" mode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.