Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 22, 2009, 1:33 PM

Post #1 of 24 (1014 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781222#action_12781222 ]

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

Spinoff from LUCENE-1606.

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 22, 2009, 2:02 PM

Post #2 of 24 (1000 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781233#action_12781233 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

Michael, here is one idea that isn't too crazy.

separately i think we should make it convenient for a MTQ to get a char[], this should not change.

however, lets consider this:
{code}
/**
* Returns true if the given string is accepted by this automaton.
*/
public boolean run(String s) {
int p = initial;
int l = s.length();
for (int i = 0; i < l; i++) {
p = step(p, s.charAt(i));
if (p == -1) return false;
}
return accept[p];
}
{code}

checking a string, is really just stepping thru one char at a time.
would 'incremental, one char at a time' conversion actually help, or do you think it would just be slower?

conceptually, this isn't that much different than using a Reader with java i/o, at a much smaller scale.
i am not familiar with decoding performance, but I thought I would mention this, just in the case there is a way to do it clean.


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 22, 2009, 3:29 PM

Post #3 of 24 (987 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781249#action_12781249 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

I changed only the accept(final TermRef term) method from Mike's flex patch of this enum to use char[], instead of string.
I did not modify the "smart" part, its more complex, but will probably help the ????NNN case.

the results change significantly for the *N case (i used my old benchmark, just because it was already setup in my eclipse)
||Pattern||Iter||AvgHits||AvgMS (String)||AvgMS (char[])||
|N?N?N?N|10|1000.0|36.2|34.9|
|?NNNNNN|10|10.0|4.9|5.1|
|??NNNNN|10|100.0|8.0|11.5|
|???NNNN|10|1000.0|35.4|34.0|
|????NNN|10|10000.0|250.9|230.9|
|NN??NNN|10|100.0|9.1|5.0|
|NN?N*|10|10000.0|8.3|7.5|
|?NN*|10|100000.0|63.5|28.7|
|*N|10|1000000.0|3027.8|1922.7|
|NNNNN??|10|100.0|3.7|3.7|

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 23, 2009, 1:30 AM

Post #4 of 24 (981 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781344#action_12781344 ]

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

bq. would 'incremental, one char at a time' conversion actually help, or do you think it would just be slower?

I like this idea! It's worth exploring a Reader-like interface from UnicodeUtil? Is this a hotspot in automaton's processing? Ie, could we save much conversion by only doing it on demand?

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 23, 2009, 3:15 AM

Post #5 of 24 (979 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781365#action_12781365 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

Michael, I think i would have to profile things to determine this?
I guess it would be a close one, because strings in term dictionary are pretty short.
just an idea, i think moving all the code to char[] first would be the best for starters.

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 23, 2009, 3:08 PM

Post #6 of 24 (977 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781694#action_12781694 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

Hi Mike, I think an easier win is to perhaps add endsWith() byte[] comparison in TermRef.
(for now, I can use regular endsWith(), or run the machine backwards, or something like that).

I can use this in "dumb mode", i.e. *N, where I know the first part of the machine is a loop.
for whatever reason dumb mode checks "constant prefix" right now, which is useless, it will always be 0 in dumb mode.
instead I should build "constant suffix" in dumb mode. this would be much more useful for a quick comparison.

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 23, 2009, 4:18 PM

Post #7 of 24 (963 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781710#action_12781710 ]

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

That sounds compelling -- you'd still do the full scan, but testing each term is much faster?

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 23, 2009, 4:46 PM

Post #8 of 24 (960 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781718#action_12781718 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

right, we could use constant suffix to stay with bytes.
for example *N in this test, well 90% of the charset conversion of TermRefs disappears, because they can be eliminated by comparing bytes.


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 10:14 AM

Post #9 of 24 (944 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782065#action_12782065 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

Mike, I implemented this common suffix, but only for dumb mode, it does not help smart mode.
so i got rid of common prefix entirely, as its useless, and just replaced it.
I also take measures to ensure the suffix is well-formed UTF-8 :)

on my *N trunk tests its now 5700/5800ms on average versus 6000ms, just using String.endsWith() before checking the DFA.
its a consistent gain, so I think for really crappy worst-case wildcards and regular expressions,
we have a lot to gain by doing this with bytes, before converting to char[] and running against the DFA.

I guess since TermRef exposes all the bytes, I could implement endsWith myself in AutomatonTermsEnum in the future,
but it seems like it would be a nice complement to startsWith() ?


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 10:36 AM

Post #10 of 24 (946 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782075#action_12782075 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

I guess now you have me starting to think about byte[] contains()
Because really the real worst case, which I bet a lot of users do, are not things like *foobar but instead *foobar\* !
in UTF-8 you can do such things safely, I would have to sucker out the "longest common constant sequence" out of a DFA.
This might be more generally applicable.

commonSuffix is easy... at least it makes progress for now, even slightly later in trunk.

this could be a later improvement.


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 2:28 AM

Post #11 of 24 (931 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782353#action_12782353 ]

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

Patch looks good, except, I think I wouldn't factor startsWith/endsWith to share any code, to save the "+ pos" inside startsWith's loop?

{quote}
*N 1705.7ms avg -> 1195.4ms avg
*NNNNNN 1844.9ms avg -> 1192.3ms avg
{quote}

Whoa -- those are great results!

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 4:14 AM

Post #12 of 24 (936 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782384#action_12782384 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

bq. Patch looks good, except, I think I wouldn't factor startsWith/endsWith to share any code, to save the "+ pos" inside startsWith's loop?

forgive my ignorance, but shouldnt the JRE hoist this constant additive to the array index out anyway?
I checked, this is how harmony, etc implement startsWith/endsWith even for String...
(I will change it, just curious)


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 4:58 AM

Post #13 of 24 (932 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782401#action_12782401 ]

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

bq. shouldnt the JRE hoist this constant additive to the array index out anyway?

Maybe?

bq. alternative patch for if you do not trust your compiler

Thanks ;)

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 5:06 AM

Post #14 of 24 (934 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782409#action_12782409 ]

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

BTW, we've discussed someday having a codec whose terms dict (or maybe just terms index) is represented as an FST, at which point AutomatonTermsEnum would be an intersection + walk of two FSTs. Because suffix's are also shared in the FST, you could more easily (more efficiently) handle *XXX cases as well (it'd just be symmetic with the XXX* cases).

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 5:08 AM

Post #15 of 24 (935 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782412#action_12782412 ]

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

Maybe also make TermRef final in the patch?

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 5:12 AM

Post #16 of 24 (930 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782414#action_12782414 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

bq. BTW, we've discussed someday having a codec whose terms dict (or maybe just terms index) is represented as an FST

this would open up more opportunities.

bq. Maybe also make TermRef final in the patch?

ok

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 8:07 AM

Post #17 of 24 (932 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782476#action_12782476 ]

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

bq. Mike, here TermRef is final also. This doesn't remove any flexibility does it?

I'd actually rather lock it down for now, and then only open up flexibility when/if we get there... patch looks good!

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 8:11 AM

Post #18 of 24 (932 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782479#action_12782479 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

bq. I'd actually rather lock it down for now, and then only open up flexibility when/if we get there... patch looks good!

Ok, I will commit it.

Just as a side note, maybe i can add a comment if you need it... the existing startsWith(), and now the new endsWith() are correct against byte[] for any Unicode encoding form.
However, some other encodings (including alternate encodings someone might flex to), do not have the properties of non-overlap, etc.

if someone was to implement a codec to store the index in one of those other encodings, they would have to write significantly more complex code that is aware of character boundaries, depending upon the properties of said encoding.
oh yeah, and their sort order would be different, too... (I suppose we should also fix compareTerm here for UTF-16 ordering at some point?)


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 8:15 AM

Post #19 of 24 (931 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782482#action_12782482 ]

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

bq. I suppose we should also fix compareTerm here for UTF-16 ordering at some point?

Yes... I'm [slowly] working towards that.

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 8:45 AM

Post #20 of 24 (932 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782495#action_12782495 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

bq. Yes... I'm [slowly] working towards that.

Glad it is you working on it instead of me. If I wrote it, it would be very slow.

Committed revision 884190 for TermRef


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 8:57 AM

Post #21 of 24 (932 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782503#action_12782503 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

Mike, by the way, looking at this code, I don't see a way to expose the UnicodeUtil / char[] functionality in a clean way via TermRef/FilteredTermsEnum.

Once I see that most of the other enums survive with TermRef alone, and don't need it, and its handy to have multiple TermRefs around in the same enum,
it doesn't make sense I guess.

Also I guess people in general aren't writing MultiTermQueries every day, so I think this is ok?
The rest of this issue should only involve automaton code itself...


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 11:17 AM

Post #22 of 24 (927 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782565#action_12782565 ]

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

OK I'll first focus on making sure DW flushes in UTF-16 sort order...

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 7, 2009, 2:30 PM

Post #23 of 24 (526 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787175#action_12787175 ]

Robert Muir commented on LUCENE-2090:
-------------------------------------

Mike, I converted this to char[] api (see LUCENE-1606 for the patch).

In order for this to work, I needed to expose UnicodeUtil.nextUTF16ValidString(UTF16Result).
The code is not duplicated, the String based method is just a wrapper for this, take a look if you get a chance.


> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 8, 2009, 2:14 AM

Post #24 of 24 (522 views)
Permalink
[jira] Commented: (LUCENE-2090) convert automaton to char[] based processing and TermRef / TermsEnum api [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787389#action_12787389 ]

Michael McCandless commented on LUCENE-2090:
--------------------------------------------

bq. Mike, I converted this to char[] api

Nice!

bq. the other thing I forgot, I think TermRef.copy(UTF8Result) would be handy... is there anywhere you could use this too?

That sounds reasonable -- maybe just add it? Or... we could also deprecate UTF8Result, entirely, replacing it w/ TermRef...? Hmmm.

> convert automaton to char[] based processing and TermRef / TermsEnum api
> ------------------------------------------------------------------------
>
> Key: LUCENE-2090
> URL: https://issues.apache.org/jira/browse/LUCENE-2090
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2090_TermRef_flex.patch, LUCENE-2090_TermRef_flex2.patch, LUCENE-2090_TermRef_flex3.patch
>
>
> The automaton processing is currently done with String, mostly because TermEnum is based on String.
> it is easy to change the processing to work with char[], since behind the scenes this is used anyway.
> in general I think we should make sure char[] based processing is exposed in the automaton pkg anyway, for things like pattern-based tokenizers and such.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.