Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language

 

 

First page Previous page 1 2 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Dec 1, 2009, 12:41 PM

Post #1 of 41 (565 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784407#action_12784407 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

Hi Ahmet, this patch is looking very nice, thank you!

I have some minor suggestions:
* can we use hex notation (maybe also constants too) for the special case?
* you can use assertTokenStreamContents here (it is in the base test case) to simplify your test, it works like assertAnalyzesTo but on tokenstream

I will let others comment on where this belongs (maybe contrib?)
Wherever it is, I would like to use it in snowball contrib also.


> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Priority: Minor
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 12:47 PM

Post #2 of 41 (542 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784410#action_12784410 ]

Uwe Schindler commented on LUCENE-2102:
---------------------------------------

Looks cool, it even uses the new CharUtils API.
+1 for using assertTokenStreamContents.

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 12:51 PM

Post #3 of 41 (544 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784414#action_12784414 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

I have one comment, that this will not work correctly on text that is not NFC. This is because uppercase I with dot can be represented as \u0130 (as you handle it), but also decomposed as \u0049 + \u0307. There can also be stuff in between technically...

after finding a regular I (\u0049) we could search ahead for COMBINING DOT ABOVE (ignoring any nonspacing marks and format and such along the way), and handle this differently.

but non-NFC text doesn't work correctly throughout most of lucene's analysis components as it is now anyway, so I don't think we should worry about it right now. Maybe we could add a comment for the future though.


> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 12:57 PM

Post #4 of 41 (542 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784418#action_12784418 ]

Uwe Schindler commented on LUCENE-2102:
---------------------------------------

As this is a new lowercasefilter, shouldn't be the default matchVersion completely removed? Other filters have deprecated the no-matchVersion filter and this one also. A new class should not have deprecated parts. -> remove

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 1:01 PM

Post #5 of 41 (543 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784421#action_12784421 ]

DM Smith commented on LUCENE-2102:
----------------------------------

bq. but non-NFC text doesn't work correctly throughout most of lucene's analysis components as it is now anyway, so I don't think we should worry about it right now. Maybe we could add a comment for the future though.

It might be good to note the NFC (NFKC?) requirement in the JavaDoc.

Maybe its just me, but I think it is critical to normalize the input to Lucene for both indexing and searching. Unless a NFCNormalizingFilter is added to Lucene, I think it is the responsibility of the caller.

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 1:05 PM

Post #6 of 41 (542 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784424#action_12784424 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

bq. Maybe its just me, but I think it is critical to normalize the input to Lucene for both indexing and searching. Unless a NFCNormalizingFilter is added to Lucene, I think it is the responsibility of the caller.

yeah I think its critical too.

bq. It might be good to note the NFC (NFKC?) requirement in the JavaDoc.

yeah or maybe just a hint in the comments (because this is an exceptionally tricky case).
this same problem also applies to ASCIIFoldingFilter, pretty much all of the analyzers, etc too...


> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 1:05 PM

Post #7 of 41 (542 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784423#action_12784423 ]

DM Smith commented on LUCENE-2102:
----------------------------------

For new classes, would it be helpful to add @since to the class JavaDoc?

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 1:09 PM

Post #8 of 41 (542 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784428#action_12784428 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

bq. Unless a NFCNormalizingFilter is added to Lucene, I think it is the responsibility of the caller.

btw DM, if you are interested, I inserted a long discussion about unicode normalization and how it interacts with Lucene tokenstreams in general in the javadoc header of ICUNormalizationFilter for LUCENE-1488. (please comment over there if you have suggestions or thoughts on it)


> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 1:21 PM

Post #9 of 41 (543 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784437#action_12784437 ]

Simon Willnauer commented on LUCENE-2102:
-----------------------------------------

There is no need to use CharacterUtils in here. You can use Character.codePointAt() directly. This is a new class and does not need to preserve any bw. compatibility. I agree with uwe, the Version should go away in this patch.

Once more thing, this patch seems to be in core. I do not see any reason why this should be in core though. We should move it to contrib though as it serves such a specific usecase.



> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 1:23 PM

Post #10 of 41 (542 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784438#action_12784438 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

Simon, I would rather see this in contrib also.

Would there be opposition to making contrib/snowball depend upon contrib/analyzers so the SnowballAnalyzer can use this filter instead of lowercase filter for the Turkish case? (based upon Version, of course)?


> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 1:31 PM

Post #11 of 41 (543 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784445#action_12784445 ]

Simon Willnauer commented on LUCENE-2102:
-----------------------------------------

bq. Would there be opposition to making contrib/snowball depend upon contrib/analyzers so the SnowballAnalyzer can use this filter instead of lowercase filter for the Turkish case? (based upon Version, of course)?

i think we can arrange something like that. Since we factored out Smart-cn the jar has reasonable size so this won't be an issue. maybe we should think about moving snowball into analyzers/snowball - just an idea.
Anyway, this is somewhat unrelated to this particular patch but still considerable.

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 1:33 PM

Post #12 of 41 (542 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784447#action_12784447 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

I don't think its really unrelated, I think its a consideration towards where we put this.

The turkish analyzer happens to be in contrib/snowball, and thats what really needs this for turkish search. (Although I agree this filter could be useful on its own)

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:07 PM

Post #13 of 41 (542 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784467#action_12784467 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

Ahmet, hi I think you might have accidentally left the old (duplicate) test in there that does not use assertTokenStreamContents?


> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:11 PM

Post #14 of 41 (542 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784469#action_12784469 ]

Ahmet Arslan commented on LUCENE-2102:
--------------------------------------

I kept the old test method and added a new one. Should i remove old one?

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:17 PM

Post #15 of 41 (542 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784471#action_12784471 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

Ahmet, I think so. they both test the same functionality, but the second test is less code, and in my opinion, better. assertTokenStreamContents does some additional checks, it clears attributes in between, it calls .end(), things like that.


> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:19 PM

Post #16 of 41 (542 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784472#action_12784472 ]

Uwe Schindler commented on LUCENE-2102:
---------------------------------------

One othe possibility to resolve the problem in a completely different way: You could wrap a MappingCharFilter on top of the input reader in Analyzer and just add a replacement for this one char:
[http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/MappingCharFilter.html]

This would be a very easy fix without code duplication. You just change the input before tokenization. And its already in Lucene core, just plug it into the analyzer's tokenStream() or reusableTokenStream() method as a wrapper around the Reader param.

This would be very easy also for the other analyzers having problem with seldom chars. It can also be used to remove chars at all or replace them by longer sequences like ä -> ae (for german).

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:23 PM

Post #17 of 41 (541 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784476#action_12784476 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

bq. One othe possibility to resolve the problem in a completely different way: You could wrap a MappingCharFilter on top of the input reader in Analyzer and just add a replacement for this one char:

Uwe, but this is inflexible. If we want to make this filter support turkish lowercasing in the future for all of unicode, not just NFC composed form, we cannot do it with MappingCharFilter. Again I don't think we should fix this now, but in the future I think we might want to.

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:27 PM

Post #18 of 41 (541 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784478#action_12784478 ]

Uwe Schindler commented on LUCENE-2102:
---------------------------------------

The patch's TurkishLowerCaseFilter is as unflexible as that. The idea is just a replacement for the current patch (and it is even a little bit more universal, because you can change the chars to map).

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:29 PM

Post #19 of 41 (541 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784479#action_12784479 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

bq. test that does not use assertTokenStreamContents is removed.

Thanks Ahmet, in my opinion this is good, we just have to figure out where to place it.

My vote is for contrib/analyzers/common/tr for now.

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:31 PM

Post #20 of 41 (541 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784480#action_12784480 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

bq. The patch's TurkishLowerCaseFilter is as unflexible as that. The idea is just a replacement for the current patch (and it is even a little bit more universal, because you can change the chars to map).

Uwe this is not true. With a tokenfilter, I can use Version that will apply the logic i mentioned above:
bq. after finding a regular I (\u0049) we could search ahead for COMBINING DOT ABOVE (ignoring any nonspacing marks and format and such along the way), and handle this differently.

you cannot do this with mappingchar filter, or rather, you could, but there would be millions of mappings for this one character. I could later patch this filter with Version and some lookahead based on unicode properties if i wanted to improve it.

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:35 PM

Post #21 of 41 (541 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784483#action_12784483 ]

Uwe Schindler commented on LUCENE-2102:
---------------------------------------

if I replace this code from Ahmet's test

{code}
public class TestTurkishLowerCaseFilter extends BaseTokenStreamTestCase {

public void testTurkishLowerCaseFilter() throws Exception {
TokenStream stream = new WhitespaceTokenizer(
new StringReader("\u0130STANBUL \u0130ZM\u0130R ISPARTA"));
TurkishLowerCaseFilter filter = new TurkishLowerCaseFilter(Version.LUCENE_30, stream);
assertTokenStreamContents(filter, new String[] {"istanbul", "izmir", "\u0131sparta",});
}

}
{code}

by that, there is not even a new class or anything needed:

{code}
public class TestTurkishLowerCaseFilter extends BaseTokenStreamTestCase {

static final NormalizeCharMap map = new NormalizeCharMap();
static {
map.add("\u0049", "0x0131");
}

public void testTurkishLowerCaseFilter() throws Exception {
TokenStream stream = new WhitespaceTokenizer(
new MappingCharFilter(map,
new StringReader("\u0130STANBUL \u0130ZM\u0130R ISPARTA")));
TurkishLowerCaseFilter filter = new LowerCaseFilter(Version.LUCENE_30, stream);
assertTokenStreamContents(filter, new String[] {"istanbul", "izmir", "\u0131sparta",});
}

}
{code}

It just works.

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:39 PM

Post #22 of 41 (541 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784485#action_12784485 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

Uwe I don't think you understand what I am saying.

if my text is instead İSTANBUL versus your İSTANBUL, it will not work.

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:41 PM

Post #23 of 41 (541 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784487#action_12784487 ]

Uwe Schindler commented on LUCENE-2102:
---------------------------------------

bq. Uwe this is not true. With a tokenfilter, I can use Version that will apply the logic i mentioned above

I am talking about *this* patch. Not any later version! I suggest to not apply this patch at all and for now tell the user to use above helper construct until we have ICU in core or whatever (sorry for the missing \u, I do not want to edit again...)

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:43 PM

Post #24 of 41 (541 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784489#action_12784489 ]

Robert Muir commented on LUCENE-2102:
-------------------------------------

Uwe, I am talking about this patch too. it is simple and can be extended to the future to handle such things.
your mappingcharfilter approach cannot, and I don't see us having ICU in core ever, even though I would love such a thing.

Additionally, it will make it easier to fix SnowballAnalyzer, which is currently *broken for turkish language* because it uses the wrong lowercase.


> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 1, 2009, 2:53 PM

Post #25 of 41 (541 views)
Permalink
[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784496#action_12784496 ]

Uwe Schindler commented on LUCENE-2102:
---------------------------------------

Robert: I understand your problem, but it affects LowerCaseFilter at all and is not special to the Turkish lower filter. If you have decomposed characters even LowerCaseFilter would fail for *all* languages (even German if you compose ä out of a and two dots). In germany really nobody uses composed chars, I do not lknow how this is in Turkey, but the last time I was there, they just used the simpliest composed chars (like germans). And for that this filter works and is a quick fix.

But I give up now.

> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

First page Previous page 1 2 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.