Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

 

 

First page Previous page 1 2 3 4 5 6 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 19, 2009, 8:36 PM

Post #1 of 131 (1294 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780430#action_12780430 ]

Mark Miller commented on LUCENE-1606:
-------------------------------------

So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it :)

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 8:52 PM

Post #2 of 131 (1247 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780432#action_12780432 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

bq. So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it

Mark, some notes on size. jarring up the full source code (no paring) is 81KB.
in practice, the jar file is larger because it contains some 'precompiled DFAs' for certain things like Unicode blocks, XML types... are these really needed?

see here for a list of what I mean: http://www.brics.dk/automaton/doc/dk/brics/automaton/Datatypes.html
I enabled these in the patch (they could be easily disabled): an example of how they are used in a regexp is like this: <Arabic>* (match 0 or more arabic characters)

if a user really wanted them, they can load them themselves, you can also create custom ones and use a DataTypesAutomatonProvider to register them for some name:
Example: your users want to be able to use <make> or <model> inside their regexps, you can register <make> and <model> to match to some DFA you make yourself.
its really a nice mechanism, but I don't think we need all the precompiled ones?


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 8:56 PM

Post #3 of 131 (1246 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780433#action_12780433 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

bq. You should also post the info about that Fuzzy possibility you were mentioning - perhaps a math head will come along and take care of that for us with the proper setup.

Right, i created a FuzzyQuery that builds in the 'naive' method. The problem is that for large strings this exponential-time naive mechanism creates a rather large NFA, and the NFA->DFA conversion is very slow.
Once the DFA is built, actually running it on a term dictionary is fast :)
So the slow part has nothing to do with lucene at all.

So we just need to build these DFAs in an efficient way:
We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 9:04 PM

Post #4 of 131 (1246 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780439#action_12780439 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times.
So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow
in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow.

but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading *
I didnt implement this here though yet.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 9:10 PM

Post #5 of 131 (1247 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780440#action_12780440 ]

Mark Miller commented on LUCENE-1606:
-------------------------------------

{code}
By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times.
So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow
in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow.
{code}

No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to the current WCQ that with this?

{quote}
but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading *
I didnt implement this here though yet.
{quote}

Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 9:16 PM

Post #6 of 131 (1247 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780441#action_12780441 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

bq. No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this?

I don't think there is any issue. by implementing WildcardQuery with the DFA, leading ? is no longer a problem,
i mean depending on your term dictionary if you do something stupid like ???????abacadaba it probably wont be that fast.

I spent a lot of time with the worst-case regex, wildcards to ensure performance is at least as good as the other alternatives.
There is only one exception, the leading * wildcard is a bit slower with a DFA than if you ran it with actual WildcardQuery (less than 5% in my tests)
Because of this, currently this patch rewrites this very special case to a standard WildcardQuery.

bq. Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool.

No, what I am saying is that you still have to index the terms in reversed order for the leading */.* case, except then this reversing buys you faster wildcard AND regex queries :)


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 9:22 PM

Post #7 of 131 (1247 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780445#action_12780445 ]

Mark Miller commented on LUCENE-1606:
-------------------------------------

Okay - still not an issue I don't think - leading wildcards are already an issue - 5% is worth the other speedups I think - though you've taken care of that anyway - so sounds like gold to me. I didn't expect this to solve leading wildcard issues, so no loss to me.

bq. No, what I am saying is that you still have to index the terms in reversed order for the leading * or .* case, except then this reversing buys you faster wildcard AND regex queries

bummer :) Does it make sense to implement here though? Isn't the ReverseStringFilter enough if a user wants to go this route? Solr's support for this is fairly good, but I don't think it needs to be as 'built in' for Lucene?

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 9:24 PM

Post #8 of 131 (1246 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780447#action_12780447 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

bq. Does it make sense to implement here though?

I do not think so. I tested another solution where users wanted leading * wildcards on 100M+ term dictionary.
I found out what was acceptable was for * to actually match .{0,3} (between 0 and 3 of anything), and rewrote it to an equivalent regex like this.
This performed very well, because it can still avoid comparing many terms.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 9:37 PM

Post #9 of 131 (1245 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780452#action_12780452 ]

Mark Miller commented on LUCENE-1606:
-------------------------------------

That is a cool tradeoff to be able to make.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 10:44 PM

Post #10 of 131 (1236 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780471#action_12780471 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

bq. That is a cool tradeoff to be able to make.

Mark, yes. I guess someone could implement the DFA-reversing if they wanted to, to enable leading .* regex support with ReverseStringFilter.
you can still use this Wildcard impl with ReverseStringFilter just like the core Wildcard impl, because its just so easy to reverse a wildcard string.

but you don't want to try to reverse a regular expression! that would be hairy. easier to reverse a DFA.

but even without this, there are tons of workarounds, like the tradeoff i mentioned earlier.
also, another one that might not be apparent is that its only the leading .* that is a problem, depending on corpus of course.

[a-z].*abacadaba will avoid visiting terms that start with 1,2,3 or are in chinese, etc, which might be a nice improvement.
of course if all your terms start with a-z, then its gonna be the same as entering .*abacadaba, and be bad.

all depends on how selective the regular expression is wrt your terms.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 6:45 AM

Post #11 of 131 (1219 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780575#action_12780575 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

bq. So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it

I think trying this out around in contrib (after 3.0 is released) would be best in the short term?

Separately, my quickly 'pared' automaton library is now 53KB jar (14 java files, some just simple POJO)
Do you have a target size I should shoot for?


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 7:05 AM

Post #12 of 131 (1220 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780584#action_12780584 ]

Mark Miller commented on LUCENE-1606:
-------------------------------------

bq. I think trying this out around in contrib (after 3.0 is released) would be best in the short term?

What are your concerns? If it passes the current wildcard tests and survives in trunk for a dev cycle, isn't that likely enough?

bq. Do you have a target size I should shoot for?

As small as possible ;) But I don't personally have any issue adding 53k to the core jar for this goodness. Guess we will have to see what others say - but its a low percentage of the current 1.1 MB, and pretty sweet functionality.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 7:15 AM

Post #13 of 131 (1220 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780589#action_12780589 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

bq. What are your concerns? If it passes the current wildcard tests and survives in trunk for a dev cycle, isn't that likely enough?

i don't really have any, except that I don't necessarily trust the current wildcard tests. Shouldn't they have detected 2.9.0 scorer bug? :)

bq. As small as possible

ok, i will work at this some more. obviously i could pare it down to just what we are using, but i am trying to preserve 'reasonable' functionality that might be handy elsewhere.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 7:20 AM

Post #14 of 131 (1220 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780591#action_12780591 ]

Mark Miller commented on LUCENE-1606:
-------------------------------------

bq. i don't really have any, except that I don't necessarily trust the current wildcard tests. Shouldn't they have detected 2.9.0 scorer bug?

If they caught that, they wouldn't catch another :) How do you want to improve them? I'll help test and write tests - we can make something much more intensive if you'd like, and then just put a flag to tone it down for normal test running.

bq. ok, i will work at this some more. obviously i could pare it down to just what we are using, but i am trying to preserve 'reasonable' functionality that might be handy elsewhere.

Right - don't go further than makes sense - even 53k -> 20k - I don't think it really matters that much. So really I meant, as small as makes reasonable sense.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 7:28 AM

Post #15 of 131 (1221 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780594#action_12780594 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

bq. How do you want to improve them?

well for one, i test all the rewrite methods and boosts here.
ok these are also now fixed as of 3.0 in core wildcard tests also (LUCENE-1951), but those were two 'buglets', just an example.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 7:34 AM

Post #16 of 131 (1220 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780599#action_12780599 ]

Mark Miller commented on LUCENE-1606:
-------------------------------------

Point taken - the tests are not perfect. They never are. But it doesn't stop us chugging along :) We can always write more tests and trunk tends to get quite a work out if you put changes in towards the beginning of a dev cycle. Bugs are inevitable, in trunk or contrib. But I don't think the wildcard impl will get much exposure in contrib anyway - its not wired into the queryparser, and it won't come with a sign saying check this out. Users will still use the standard wildcardquery - and I want to see it improved. We can work out the patch, work out the tests, and then decided its not good enough - or perhaps another committer will look and decided that. I'd still love to put it in core myself.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 7:38 AM

Post #17 of 131 (1219 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780602#action_12780602 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

Mark, ok. I will supply a new patch with no lib dependency, instead it includes the pared Automaton code in one pkg.
this compiles to about a 48KB jar right now. Reducing it more would involve sacrificing readability or useful stuff.

Example: keeping the "Matcher" would be useful if you want to use this for a really fast 'PatternTokenizer', but not needed here.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 8:28 AM

Post #18 of 131 (1220 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780629#action_12780629 ]

Mark Miller commented on LUCENE-1606:
-------------------------------------

bq. I assume we should nuke the old WildcardQuery, rename AutomatonWildcardQuery to WildcardQuery?

Yes - I think so - but how to handle the fact that you fall back to it? We might either rename it or incorporate it into the new WildcardQuery?

bq. but then what should AutomatonRegexQuery be called, we already have RegexQuery?

Shouldn't this one eventually make the old obsolete? I say we name it RegexQuery.

bq. Thoughts on the automaton src code? Should I reformat to our style... (I did not do this).

Yup - I think we should reformat and drop the author tags. We can mention that type of info in the NOTICE file.

bq. should we rename the pkg?

I think so - perhaps util.brics? No need for dk I don't think.

bq. sorry the patch is monster, if it makes it any easier i could split the automaton library itself away from the lucene integration (queries, etc)?

One large patch is fine with me - my IDE will make short work of groking it :)

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 8:34 AM

Post #19 of 131 (1218 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780633#action_12780633 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

bq. Yes - I think so - but how to handle the fact that you fall back to it? We might either rename it or incorporate it into the new WildcardQuery?
We could just remove the .rewrite(). it is only that very special case, for leading *, where the existing WildcardQuery logic is slightly faster (< 5%).
I was actually surprised the wildcardquery logic beats a DFA, i guess something to be said for that hairy logic :)

bq. Shouldn't this one eventually make the old obsolete? I say we name it RegexQuery.
I do not know, all regex is not created equal. This one has different syntax and stuff from the other impl's.
Any other ideas? Obviously the name RegexpQuery, with a p, is available

bq. Yup - I think we should reformat and drop the author tags. We can mention that type of info in the NOTICE file.
ok, this is easy, i already have NOTICE in the patch. i was sure all files from brics have their license header also.

bq. I think so - perhaps util.brics? No need for dk I don't think.
o.a.l.util.brics?


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 8:50 AM

Post #20 of 131 (1227 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780643#action_12780643 ]

Mark Miller commented on LUCENE-1606:
-------------------------------------

bq. We could just remove the .rewrite(). it is only that very special case, for leading *, where the existing WildcardQuery logic is slightly faster (< 5%).

Agreed - not worth the extra code for speeding up such a horrible case by 5%.

bq. Any other ideas?

I'd rather change the contrib names and let core have the good name :) We can start with RegexpQuery I think.

bq. o.a.l.util.brics?

Thats my best thought at the moment.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 9:04 AM

Post #21 of 131 (1220 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780654#action_12780654 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

OK we have the start of a plan, only one final nit I am worried about.

I pared away the 'built-in named automata':
* example <Lu> (uppercase letter, from Unicode)
* example <QName> (from XML)

if we keep the original pkg name, a user can have these just by adding brics.jar into their path.
They would just pass new DatatypesAutomatonProvider() to the constructor of RegexpQuery, done.

if we rename the pkg, this will not work because the DataTypesAutomatonProvider from the jar file implements dk.brics.automaton.AutomatonProvider,
not o.a.l.util.brics.AutomatonProvider.

alternatively, we could rename the pkg, but I could restore perhaps a subset of these datatypes, maybe without all the xml ones, just the basic unicode categories and stuff?
This would cost a little space though. Here is the list:
http://www.brics.dk/automaton/doc/dk/brics/automaton/Datatypes.html#get%28java.lang.String%29

i ask this question because personally i don't use any of these built-ins, but users might want them? what do you think?


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 9:08 AM

Post #22 of 131 (1221 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780655#action_12780655 ]

Mark Miller commented on LUCENE-1606:
-------------------------------------

On the way hand I'd say, well lets not rename the package then - its not that important. But these things could get out of sync anyway, so I'm not sure its worth it to try and maintain some sort of compatibility. If these features are useful enough, we could end up adding them later. Your call though. Personally, I'd think we start just by adding the essentials and build from there as makes sense.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 9:14 AM

Post #23 of 131 (1220 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780659#action_12780659 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

Mark, ok. In that case I will not include these, and rename the pkg as you suggest.
These default named automata are not enabled by default in the library anyway if you use the RegExp() default constructor.
the (renamed and pared) api is still extensible, if you want to create named automata to use in your regular expressions, you just implement the very simple interface DatatypesAutomatonProvider.
then you pass this to the constructor of RegexpQuery


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 10:12 AM

Post #24 of 131 (1221 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780684#action_12780684 ]

Robert Muir commented on LUCENE-1606:
-------------------------------------

bq. Okay - still not an issue I don't think - leading wildcards are already an issue - 5% is worth the other speedups I think

Mark, the old WildcardTermEnum is public, so we must keep it around for a while anyway.
I can use it for this case, so we don't lose this 5% in the special case :)

Might be worth deprecating this old WildcardTermEnum still though, just because its code to be maintained, hardly used except for this purpose.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 11:59 AM

Post #25 of 131 (1215 views)
Permalink
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780738#action_12780738 ]

Mark Miller commented on LUCENE-1606:
-------------------------------------

Nice! Resulting jar is still just 1.0 MB. Looks great on a quick look through. I'll go over more thoroughly when I get a chance.

As far as testing, one of the simple things we can try is generating random wildcard strings against a large corpus and auto comparing the results of the old and the new.

+1 on the automaton name for the util package.

I'd almost still prefer RegexQuery - its contrib vs core and different packages - I hate to lose out on the better name. Though thats a bit subjective ;)

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

First page Previous page 1 2 3 4 5 6 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.