Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

 

 

First page Previous page 1 2 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 20, 2009, 8:18 AM

Post #1 of 41 (1319 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606_nodep.patch

attached is an alternate patch with no library dependency (LUCENE-1606_nodep.patch)
instead it imports 'pared-down' automaton source code (compiles to 48KB jar)
it is still setup in contrib regex because...

Mark: some practical questions, I'd like to create a patch that integrates it nicely into core, just so we can see what it would look like.
Thoughts on class names and pkg names?
* I assume we should nuke the old WildcardQuery, rename AutomatonWildcardQuery to WildcardQuery?
* but then what should AutomatonRegexQuery be called, we already have RegexQuery :)

Thoughts on the automaton src code? Should I reformat to our style... (I did not do this).
should we rename the pkg?

sorry the patch is monster, if it makes it any easier i could split the automaton library itself away from the lucene integration (queries, etc)?

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 11:21 AM

Post #2 of 41 (1281 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606.patch

Mark, I think this patch is ok, all tests pass etc.
Can you take a look and let me know your thoughts?

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 4:57 PM

Post #3 of 41 (1279 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1606:
----------------------------------

Attachment: LUCENE-1606.patch

Hi Robert,

here is my patch. The WildCard and RegExp test querys still pass. I also added a test for the deprec TermEnum (just a simple MTQ that returns it is used and should produce same results as WildcardQuery).

The AutomatonTermEnum now switches between smart(R) and non-smart mode using your detection algorithm. termCompare now handles both cases.
next() just calls super in the linear case (so it behaves like a normal FilteredTermEnum) and uses the smart(R) code in all other cases.

I will go to bed now, tell me if you like it.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 4:26 AM

Post #4 of 41 (1265 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1606:
----------------------------------

Attachment: LUCENE-1606.patch

Some cleanups and a more consistent endEnum handling. Also added Javadocs explaining smart and linear mode.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 4:40 AM

Post #5 of 41 (1266 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1606:
----------------------------------

Attachment: LUCENE-1606.patch

Again some updates, moved the '*' and '?' constants also to WildcardQuery and use them in switch. Also added better deprecation messeg to WildcardTermEnum.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 6:18 AM

Post #6 of 41 (1268 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1606:
----------------------------------

Component/s: (was: contrib/*)
Search

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 7:23 AM

Post #7 of 41 (1273 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606.patch

fix the ΓΈ in NOTICE, cleanup some unused imports, etc.

now that Uwe fixed a performance bug in the dumb enum (it would never set endEnum=true but instead false),
I will dig up my old performance tests and see how we are looking.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 9:42 AM

Post #8 of 41 (1265 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606.patch

this patch fixes a bug i introduced when i removed recursion.
the wildcard tests do not detect it... told you i didnt trust them :)

I will add a test for this, although it was an obvious mistake on my part.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 10:20 AM

Post #9 of 41 (1261 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: BenchWildcard.java

attached is benchmark, which generates random wildcard queries.
it builds an index of 10million docs, each with a term from 0-10million.
it will fill a pattern such as N?N?N?N? with random digits, substituting a random digit for N.

||Pattern||Iter||AvgHits||AvgMS (old)||AvgMS (new)||
|N?N?N?N|10|1000.0|288.6|38.5|
|?NNNNNN|10|10.0|2453.1|6.4|
|??NNNNN|10|100.0|2484.2|10.1|
|???NNNN|10|1000.0|2821.3|47.8|
|????NNN|10|10000.0|2346.9|299.8|
|NN??NNN|10|100.0|34.8|6.3|
|NN?N*|10|10000.0|26.5|9.4|
|?NN*|10|100000.0|2009.0|73.5|
|*N|10|1000000.0|6837.4|6087.9|
|NNNNN??|10|100.0|1.9|2.3|

i would like to incorporate part of this logic into the junit tests, on maybe a smaller index, because its how i found the recursion bug.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 22, 2009, 10:17 AM

Post #10 of 41 (1232 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1606:
---------------------------------------

Attachment: LUCENE-1606-flex.patch

First cut @ cutting over to flex API attached -- note that this
applies to the flex branch, not trunk!

I made some small changes to the benchmarker: use constant score
filter mode, and print the min (not avg) time (less noise).

Also, I ported the AutomatonTermEnum to the flex API, so this is now a
better measure ("flex on flex") of what future perf will be. It's
possible there's a bug here, though TestWildcard passes.

I still need to investigate why "non-flex on non-flex" and "non-flex
on flex" perform worse.

I ran like this:

java -server -Xmx1g -Xms1g BenchWildcard

java is 1.6.0_14 64 bit, on OpenSolaris.

Results (msec is min of 10 runs each);

||Pattern||ITrunk (min msec)||(Flex (min msec)||
|N?N?N?N0.0|13|18|
|?NNNNNN|1|3|
|??NNNNN|4|6|
|???NNNN|23|28|
|????NNN|210|170|
|NN??NNN|3|3|
|NN?N*|7|4|
|?NN*|62|30|
|*N|4332|2576|
|NNNNN??|1|1|

Looks like flex API is faster for the slow queries. Once I fix caching on trunk we should retest...


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 22, 2009, 3:39 PM

Post #11 of 41 (1224 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606-flex.patch

Mike, here is an update to your flex patch. I restored back two tests that disappeared (TestRegexp, etc)
Also, I converted the enum to use char[] as an experiment. i posted the results on LUCENE-2090.
this is just a hack, it stores the UTF16Result in AutomatonEnum
I figured i would pass it back in case, just in the case you wanted to play more.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 4:22 AM

Post #12 of 41 (1180 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606.patch

updated patch:
* don't seek to high surrogates, instead tack on \uDC00. this still works for trunk, but also with flex branch.
* don't use a high surrogate prefix, instead truncate. this isn't being used at all, i would rather use 'constant suffix'
* add tests that will break if lucene's sort order is not UTF-16 (or if automaton is not adjusted to the new sort order)
* add another enum constructor, where you can specify smart or dumb mode yourself
* regexp javadoc note
* add wordage to LICENSE, not just NOTICE


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 4:26 AM

Post #13 of 41 (1181 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: (was: LUCENE-1606.patch)

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 4:28 AM

Post #14 of 41 (1183 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606.patch

sorry, my ide added a @author tag. i need to look to see where to turn this @author generation off for eclipse.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 10:46 AM

Post #15 of 41 (1177 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606.patch

this patch removes constant prefix, as its only used in dumb mode, and in dumb mode there is no constant prefix.
instead its replaced with constant suffix, which speeds up comparisons.

constant suffix is implemented naively as reversing the DFA, taking its constant prefix, then reversing that.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 2:20 PM

Post #16 of 41 (1171 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: (was: LUCENE-1606.patch)

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 2:20 PM

Post #17 of 41 (1176 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606.patch

in this patch, if the automaton is finite, always seek.
if its infinite, keep reading terms sequentially until a term fails (then seek)

it seems to be the best of both worlds, and makes perfect sense to me.

only thing that has me nervous is that SpecialOperations.isFinite() is defined recursively... will have to look into maybe trying to write this method iteratively, in case someone builds some monster automaton from a 2 page regexp or something like that.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 2:20 PM

Post #18 of 41 (1170 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606.patch

sorry, wrong file. getting lost in iterations of this patch.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 2, 2009, 2:22 PM

Post #19 of 41 (920 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606.patch

here is an updated patch. In my opinion all that is needed is to add the random testing, and code reorganization (but no algorithmic/feature changes), maybe better docs too.

This patch has:
* more tests, especially surrogate handling
* some more automaton paring
* the seek positions and common suffix are always valid UTF-8 strings
* backtracking is aware of U+FFFF, so terms can have it.

This patch works correctly on both trunk and flex, although I don't have an included test for U+FFFF since I can't put it in a trunk index.

I'm not too terribly happy about the way the unicode handling works here, its correct but could be coded better.
The code I wrote is not the easiest to read and suggestions welcome :)
But it is all localized to one method, cleanupPosition(). This is defined as

{code}
* if the seek position cannot be converted to valid UTF-8,
* then return the next valid String (in UTF-16 sort order) that
* can be converted to valid UTF-8.
{code}

if you have ideas on how to make this nicer I am happy to hear them.



> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 4, 2009, 6:52 AM

Post #20 of 41 (902 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606.patch

I added random testing for wildcards and regexps.
Don't know what else needs to be done here, please review if you can.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 4, 2009, 10:12 AM

Post #21 of 41 (901 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-1606:
--------------------------------

Attachment: LUCENE-1606-flex.patch

attached is a port of the latest trunk patch to flex branch, for experimenting or whatever.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 5, 2009, 7:07 AM

Post #22 of 41 (881 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1606:
----------------------------------

Attachment: LUCENE-1606-flex.patch

Here a flex patch for automaton. It contains LUCENE-2110, as soon as 2110 is committed I will upload a new patch. But its hard to differentiate between all modified files.

Robert: Can you do performance tests with the old and new flex patch, I do not want to commit 2110 before.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 5, 2009, 7:17 AM

Post #23 of 41 (882 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1606:
----------------------------------

Attachment: LUCENE-1606-flex.patch

New patch, there was a lost private field. Also changed the nextSeekTerm method to be more straigtForward.

Robert: Sorry, it would be better to test this one *g*

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 5, 2009, 7:17 AM

Post #24 of 41 (881 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1606:
----------------------------------

Attachment: (was: LUCENE-1606-flex.patch)

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 5, 2009, 11:52 AM

Post #25 of 41 (876 views)
Permalink
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-1606:
----------------------------------

Attachment: LUCENE-1606-flex.patch

An update with the changed nextSeekTerm() semantics from LUCENE-2110.

Robert: Can you test performance again and compare with old?

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

First page Previous page 1 2 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.