Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 16, 2009, 1:40 PM

Post #1 of 17 (838 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778559#action_12778559 ]

Simon Willnauer commented on LUCENE-2074:
-----------------------------------------

This might be the wrong place to mention it but I feel bad about this whole Version Enum. It became a pest spread all over the code. Lucene Code begins to look like the C++ Boost library where you see more precompiler statements than template code. We should really try hard to find different solutions than spreading Version all over the place.
I know this is a hard problem but I want to make sure that we do not spread is out in every corner of the code. The Version thing is already annoying enough in Contrib/analyzer.


> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 1:42 PM

Post #2 of 17 (813 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778562#action_12778562 ]

Uwe Schindler commented on LUCENE-2074:
---------------------------------------

For this one it's not new, it was there before my patch :-)

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 1:46 PM

Post #3 of 17 (812 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778569#action_12778569 ]

Robert Muir commented on LUCENE-2074:
-------------------------------------

I am anti-Version too in a lot of ways. I worry it will spread everywhere and make things a mess, and maybe we can come up with more creative solutions to get rid of it in the future.

But I think Uwe's patch is ok, Version was actually created with StandardTokenizer in mind I believe...

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 1:46 PM

Post #4 of 17 (815 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778570#action_12778570 ]

Simon Willnauer commented on LUCENE-2074:
-----------------------------------------

bq. For this one it's not new, it was there before my patch
This is just the latest issue I found related to that stuff. The version member was there already but the conditional was introduced. Don't get me wrong I just wanna make sure we are not using it as a general purpose conditional! This is going to be a nightmare otherwise. I would only use it if there is NO other way at all.

simon


> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 1:48 PM

Post #5 of 17 (812 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778571#action_12778571 ]

Mark Miller commented on LUCENE-2074:
-------------------------------------

{quote} We should really try hard to find different solutions than spreading Version all over the place.
I know this is a hard problem but I want to make sure that we do not spread is out in every corner of the code. The Version thing is already annoying enough in Contrib/analyzer. {quote}

The problem is, these are the hard backwards compat situations that it was created for - the whole analyzer package was/is bound to have lots of Version stuff.

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 1:50 PM

Post #6 of 17 (813 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778574#action_12778574 ]

Simon Willnauer commented on LUCENE-2074:
-----------------------------------------

nothing against the patch! I just used this issue as a channel and I agree there would have been a better choice though.

simon

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 1:52 PM

Post #7 of 17 (812 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778575#action_12778575 ]

Uwe Schindler commented on LUCENE-2074:
---------------------------------------

I add the warning to my patch! Thanks. What do you think about the patch, should we go that way?

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 1:57 PM

Post #8 of 17 (814 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778579#action_12778579 ]

Robert Muir commented on LUCENE-2074:
-------------------------------------

Uwe, also, just checking, i don't know javacc at all, does it use unicode properties? We have a lot of queryparsers out there...

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 1:59 PM

Post #9 of 17 (814 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778580#action_12778580 ]

Simon Willnauer commented on LUCENE-2074:
-----------------------------------------

bq. The problem is, these are the hard backwards compat situations that it was created for - the whole analyzer package was/is bound to have lots of Version stuff.

Afaik in the contrib/analyzis package this is only used because of STDAnalyzer and StopFilter. It seems like a kind of an overkill. But again I should on do any "issue - high-jacking" here. The thing is even if you come up with a better solution this will most likely stay forever - just like all the IFDEF stuff in Boost :(

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 2:01 PM

Post #10 of 17 (815 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778582#action_12778582 ]

Uwe Schindler commented on LUCENE-2074:
---------------------------------------

bq. Uwe, also, just checking, i don't know javacc at all, does it use unicode properties? We have a lot of queryparsers out there...

I do not know it, too :)

The only query parser using jflex is the new one. And the new one should normally use no unicode properties. Can you check the JFlex file? All other query parsers use JavaCC.

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 2:03 PM

Post #11 of 17 (814 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778583#action_12778583 ]

Michael McCandless commented on LUCENE-2074:
--------------------------------------------

bq. I feel bad about this whole Version Enum

I think this is simply a sign of 1) Lucene's maturity, and 2) that we
take back compat seriously. I actually think we don't yet use it
enough...

EG, LUCENE-1255 was one nasty bug, that we at first fixed, but then
rolled back, because of the back-compat break. Then it was
rediscovered and opened again, as LUCENE-1542, when we decided it was
nasty enough to just fix it and put an entry in CHANGES that you
hopefully will read.

But it really is a back-compat break, in that apps could quite easily
be relying on the buggy behavior. I think that bug would have been a
good reason to add Version to IW.

Fixing invalid acronyms in StandardAnalyzer, but then leaving it
broken by default, was the original "inspiration" for Version. We
shouldn't every fix a bug, but then be forced to leave the bug in
place due to back compat.

Version lets us fix bugs, change defaults for the better, etc., w/o
compromising on our back compat policy. It's an impoprtant
tool...

bq. The problem is, these are the hard backwards compat situations that it was created for - the whole analyzer package was/is bound to have lots of Version stuff.

Right, I think Version will especially find its way into changes that
alter what's indexed (analyzers, bugs like LUCENE-1255, etc.).

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 2:07 PM

Post #12 of 17 (812 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778585#action_12778585 ]

Robert Muir commented on LUCENE-2074:
-------------------------------------

well, the wikipediatokenizer at least is similar to standardtokenizer, except it does not use unicodeproperties, instead hardcoded ranges.

so the behavior won't change from 2.9.x, but it wont be unicode 4 either, don't know if we should worry about this?

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 2:17 PM

Post #13 of 17 (816 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778598#action_12778598 ]

Uwe Schindler commented on LUCENE-2074:
---------------------------------------

It uses hardcode char ranges, the parser is therefore not JDK-dependent. Let's keep it as it is for now. Mediawiki itsself is not unicode conform, because it's written in PHP and PHP only gets unicode in 6.0 *lol* (that says a PHP core committer names U.S. from Bremen in Germany...)

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 2:41 PM

Post #14 of 17 (787 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778614#action_12778614 ]

Uwe Schindler commented on LUCENE-2074:
---------------------------------------

Should we fix this for 3.0 or not?
The current JFlex file in trunk/lucene_30 is generated by Java 1.4 (I verified), so it does not break. So we could wait for 3.1 and provide there a new StandardTokenizer with unicode 5 support


> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 2:45 PM

Post #15 of 17 (787 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778615#action_12778615 ]

Robert Muir commented on LUCENE-2074:
-------------------------------------

Uwe, we could fix in 3.1 (but we should commit the warning no matter what I think!)

If we commit for 3.0, then its still not really correct for Unicode 4.
in my opinion, better would be to wait for 3.1 and use this interface you built, along with a new version of Jflex for much better unicode support?


> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0
>
> Attachments: jflexwarning.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 2:06 AM

Post #16 of 17 (775 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778799#action_12778799 ]

Uwe Schindler commented on LUCENE-2074:
---------------------------------------

Do we agree, that this patch should wait for 3.1, as the JFlex parser is already ok and backwards compatible in 3.0, so no need to do anything? In 3.1 together with the other unicode changes, we will update StandardTokenizer with Version.LUCENE_31?

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0, 3.1
>
> Attachments: jflexwarning.patch, LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 5:20 AM

Post #17 of 17 (765 views)
Permalink
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778858#action_12778858 ]

Mark Miller commented on LUCENE-2074:
-------------------------------------

+1 here

> Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
> -------------------------------------------------------------------------------
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.0
> Reporter: Uwe Schindler
> Assignee: Uwe Schindler
> Fix For: 3.0, 3.1
>
> Attachments: jflexwarning.patch, LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.