Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 19, 2009, 8:48 PM

Post #1 of 22 (1116 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex)

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780430#action_12780430 ]

Mark Miller edited comment on LUCENE-1606 at 11/20/09 4:46 AM:
---------------------------------------------------------------

So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it :)

You should also post the info about that Fuzzy possibility you were mentioning - perhaps a math head will come along and take care of that for us with the proper setup.

was (Author: markrmiller [at] gmail):
So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it :)

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 9:10 PM

Post #2 of 22 (1084 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780440#action_12780440 ]

Mark Miller edited comment on LUCENE-1606 at 11/20/09 5:10 AM:
---------------------------------------------------------------

{quote}
By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times.
So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow
in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow.
{quote}

No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this?

{quote}
but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading *
I didnt implement this here though yet.
{quote}

Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool.

was (Author: markrmiller [at] gmail):
{code}
By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times.
So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow
in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow.
{code}

No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to the current WCQ that with this?

{quote}
but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading *
I didnt implement this here though yet.
{quote}

Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 9:18 PM

Post #3 of 22 (1094 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780441#action_12780441 ]

Robert Muir edited comment on LUCENE-1606 at 11/20/09 5:16 AM:
---------------------------------------------------------------

bq. No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this?

I don't think there is any issue. by implementing WildcardQuery with the DFA, leading ? is no longer a problem,
i mean depending on your term dictionary if you do something stupid like ???????abacadaba it probably wont be that fast.

I spent a lot of time with the worst-case regex, wildcards to ensure performance is at least as good as the other alternatives.
There is only one exception, the leading * wildcard is a bit slower with a DFA than if you ran it with actual WildcardQuery (less than 5% in my tests)
Because of this, currently this patch rewrites this very special case to a standard WildcardQuery.

bq. Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool.

No, what I am saying is that you still have to index the terms in reversed order for the leading * or .* case, except then this reversing buys you faster wildcard AND regex queries :)


was (Author: rcmuir):
bq. No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this?

I don't think there is any issue. by implementing WildcardQuery with the DFA, leading ? is no longer a problem,
i mean depending on your term dictionary if you do something stupid like ???????abacadaba it probably wont be that fast.

I spent a lot of time with the worst-case regex, wildcards to ensure performance is at least as good as the other alternatives.
There is only one exception, the leading * wildcard is a bit slower with a DFA than if you ran it with actual WildcardQuery (less than 5% in my tests)
Because of this, currently this patch rewrites this very special case to a standard WildcardQuery.

bq. Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool.

No, what I am saying is that you still have to index the terms in reversed order for the leading */.* case, except then this reversing buys you faster wildcard AND regex queries :)


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 9:32 PM

Post #4 of 22 (1084 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780447#action_12780447 ]

Robert Muir edited comment on LUCENE-1606 at 11/20/09 5:31 AM:
---------------------------------------------------------------

bq. Does it make sense to implement here though?

I do not think so. I tested another solution where users wanted leading * wildcards on 100M+ term dictionary.
I found out what was acceptable (clarification: to these specific users/system) was for * to actually match .{0,3} (between 0 and 3 of anything), and rewrote it to an equivalent regex like this.
This performed very well, because it can still avoid comparing many terms.

was (Author: rcmuir):
bq. Does it make sense to implement here though?

I do not think so. I tested another solution where users wanted leading * wildcards on 100M+ term dictionary.
I found out what was acceptable was for * to actually match .{0,3} (between 0 and 3 of anything), and rewrote it to an equivalent regex like this.
This performed very well, because it can still avoid comparing many terms.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 8:28 AM

Post #5 of 22 (1065 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780623#action_12780623 ]

Robert Muir edited comment on LUCENE-1606 at 11/20/09 4:27 PM:
---------------------------------------------------------------

attached is an alternate patch with no library dependency (LUCENE-1606_nodep.patch)
instead it imports 'pared-down' automaton source code (compiles to 48KB jar)
it is still setup in contrib regex because...

Mark: some practical questions, I'd like to create a patch that integrates it nicely into core, just so we can see what it would look like.
Thoughts on class names and pkg names?
* I assume we should nuke the old WildcardQuery, rename AutomatonWildcardQuery to WildcardQuery?
* but then what should AutomatonRegexQuery be called, we already have RegexQuery :)

Thoughts on the automaton src code? Should I reformat to our style... (I did not do this).
should we rename the pkg?

sorry the patch is monster, if it makes it any easier i could split the automaton library itself away from the lucene integration (queries, etc)?
also, i did not remove any tests, for example, TestWildcardQuery already exists, so the test here is just duplication, i just might add a test or 2 to the existing TestWildcardQuery


was (Author: rcmuir):
attached is an alternate patch with no library dependency (LUCENE-1606_nodep.patch)
instead it imports 'pared-down' automaton source code (compiles to 48KB jar)
it is still setup in contrib regex because...

Mark: some practical questions, I'd like to create a patch that integrates it nicely into core, just so we can see what it would look like.
Thoughts on class names and pkg names?
* I assume we should nuke the old WildcardQuery, rename AutomatonWildcardQuery to WildcardQuery?
* but then what should AutomatonRegexQuery be called, we already have RegexQuery :)

Thoughts on the automaton src code? Should I reformat to our style... (I did not do this).
should we rename the pkg?

sorry the patch is monster, if it makes it any easier i could split the automaton library itself away from the lucene integration (queries, etc)?

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 2:47 PM

Post #6 of 22 (1070 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780801#action_12780801 ]

Uwe Schindler edited comment on LUCENE-1606 at 11/20/09 10:45 PM:
------------------------------------------------------------------

bq. Uwe, i looked at the WildcardTermEnum and it was easy to make it a subclass, with no logic, just a ctor.

That was my idea!

bq. This is where all the logic really is anyway.

We should simply add a test for this method and everything else is the WildCardEnum. The good thing of subclassing it is, that one has a more performat class if it uses common prefixes and so on than the version we currently have. The wildcardEquals method must stay, but it is not used, so explicitely mark it as "dead code".

The good thing: the method is final (this is what I see from yor fragment) - so nobody was able to override it to change the behaviour of the enum, so nothing can break.

I would go this way.

was (Author: thetaphi):
bq. Uwe, i looked at the WildcardTermEnum and it was easy to make it a subclass, with no logic, just a ctor.

That was my idea!

bq. This is where all the logic really is anyway.

We should simply add a test for this method and everything else is the WildCardEnum. The good thing of subclassing it is, that one has a more performat class if it uses common prefixes and so on than the version we currently have. The wildcardEquals method must stay, but it is not used, so explicitely mark it as "dead code".

I would go this way.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 3:35 PM

Post #7 of 22 (1067 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780833#action_12780833 ]

Robert Muir edited comment on LUCENE-1606 at 11/20/09 11:33 PM:
----------------------------------------------------------------

bq. That would make the enum ugly... But would work.

This is why i did not do it (i tried and it was ugly), i did not want to make a complicated enum ugly!
I'll try to think of how this can be done without it being so ugly.

edit, by the way Uwe, if you are bored and want to take a stab at this :) You know your way around multitermquery better than me.

was (Author: rcmuir):
bq. That would make the enum ugly... But would work.

This is why i did not do it (i tried and it was ugly), i did not want to make a complicated enum ugly!
I'll try to think of how this can be done without it being so ugly.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/*
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 11:48 AM

Post #8 of 22 (1041 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781037#action_12781037 ]

Robert Muir edited comment on LUCENE-1606 at 11/21/09 7:47 PM:
---------------------------------------------------------------

Mark, one last comment. I want to mention this impl is largely unoptimized (from a code perspective, but the algorithm is better)
I think you see that from the NNNNN?? being 2.3ms on average versus 1.9, not that I am sure that isnt just a random hiccup.

So I want to incorporate the logic of some of this benchmark into the tests, so that we can improve the actual code impl. to speed up cases like that.
While i focus on the scalability, i know a lot of people have small indexes and maybe lots of qps and I don't want to slow them down.

Some of this is easy, for example we make State.getSortedTransitionArray public, so we don't have to convert from arrays to lists to arrays and such, for no good reason.


was (Author: rcmuir):
Mark, one last comment. I want to mention this impl is largely unoptimized (from a code perspective, but the algorithm is better)
I think you see that from the NNNNN?? being 2.3ms on average versus 1.9, not that I am sure that isnt just a random hiccup.

So I want to incorporate the logic of some of this benchmark into the tests, so that we can improve the actual code impl. to speed up cases like that.
While i focus on the scalability, i know a lot of people have small indexes and maybe lots of qps and I don't want to slow them down.

Some of this is easy, for example we make Transition.getSortedTransitionArray public, so we don't have to convert from arrays to lists to arrays and such, for no good reason.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 22, 2009, 9:07 AM

Post #9 of 22 (1029 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781162#action_12781162 ]

Robert Muir edited comment on LUCENE-1606 at 11/22/09 5:06 PM:
---------------------------------------------------------------

bq. If it's "only" that the syntax is different, that's one thing... but if eg certain functionality isn't possible w/ new or old, that's another.

from a glance, it appears to me that both the syntax and functionality of our two contrib impls (java.util and jakarta), are very different.

here is one example. Java.util supports reluctant {m,n} closures, jakarta does not, it says this right in the javadocs.
http://jakarta.apache.org/regexp/apidocs/org/apache/regexp/RE.html

Should RE support reluctant {m,n} closures (does anyone care)?
But it supports reluctant versus greedy for other operators.

in automaton, this concept of reluctance versus greedy, does not even exist, as spelled out on their page:
The * operator is mathematically the Kleene star operator (i.e. we don't have greedy/reluctant/possesive variants).
http://www.brics.dk/automaton/faq.html

this is an example, where all 3 are different... i guess i kinda assumed everyone was aware that all these regex packages are very different.

was (Author: rcmuir):
bq. If it's "only" that the syntax is different, that's one thing... but if eg certain functionality isn't possible w/ new or old, that's another.

from a glance, it appears to me that both the syntax and functionality of our two contrib impls (java.util and jakarta), are very different.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 23, 2009, 7:38 AM

Post #10 of 22 (1007 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781431#action_12781431 ]

Robert Muir edited comment on LUCENE-1606 at 11/23/09 3:38 PM:
---------------------------------------------------------------

Mike, just one comment here.

I am definitely willing to do refactoring here to support this byte[] scheme if necessary, I don't want to give the wrong impression. I think i already have an issue here related to UTF-16 binary order vs UTF-8 binary order that I need to fix, although I think this is just writing a Comparator.

edit: pretty sure this exists. If someone has say, both data from Arabic Presentation forms and Chinese text outside the BMP in the index, the "smart" enumerator will unknowningly skip right past the Arabic Presentation forms block, because it sorts after the lead surrogate in UTF-16 order, but before the entire codepoint in UTF-8/UTF-32 order. I have not experienced this in practice, because i normalize my text so i don't have stuff in Arabic Presentation forms block :) I can fix this, but i would like to see what the approach is for the flex branch, as its sufficiently compex that I would rather not fix it twice.

I am just concerned about other similar applications outside of lucene, or some already in lucene core itself!


was (Author: rcmuir):
Mike, just one comment here.

I am definitely willing to do refactoring here to support this byte[] scheme if necessary, I don't want to give the wrong impression. I think i already have an issue here related to UTF-16 binary order vs UTF-8 binary order that I need to fix, although I think this is just writing a Comparator.

I am just concerned about other similar applications outside of lucene, or some already in lucene core itself!


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 5:30 AM

Post #11 of 22 (966 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781914#action_12781914 ]

Robert Muir edited comment on LUCENE-1606 at 11/24/09 1:30 PM:
---------------------------------------------------------------

i think there is one last problem with this for flex branch, where you have abacadaba\uFFFC, abacadaba\uFFFD and abacadaba\uFFFE in the term dictionary, but a regex the matches say abacadaba[\uFFFC\uFFFF]. in this case, the match on abacadaba\uFFFD will fail, it will try to seek to the "next" string, which is abacadaba\uFFFF, but the FFFF will get replaced by FFFD by the byte conversion, and we will loop.

mike i don't think this should be any back compat concern, unlike the high surrogate case which i think many CJK applications are probably doing to...


was (Author: rcmuir):
i think there is one last problem with this for flex branch, where you have abacadaba\uFFFC, abacadaba\uFFFD and abacadaba\uFFFE in the term dictionary, but a regex the matches say abacadaba[\uFFFC\uFFFE]. in this case, the match on abacadaba\uFFFD will fail, it will try to seek to the "next" string, which is abacadaba\uFFFE, but the FFFE will get replaced by FFFD by the byte conversion, and we will loop.

mike i don't think this should be any back compat concern, unlike the high surrogate case which i think many CJK applications are probably doing to...


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 5:45 AM

Post #12 of 22 (967 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781926#action_12781926 ]

Robert Muir edited comment on LUCENE-1606 at 11/24/09 1:44 PM:
---------------------------------------------------------------

bq. about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum?

look at the code to nextString() itself.
it uses cleanupPosition() which works differently.

when seeking, we can append \uDC00 to achieve the same thing as seeking to a high surrogate.
when using a prefix, we have to truncate the high surrogate, because we cannot use it with TermRef.startsWith() etc, it cannot be converted into UTF-8 bytes. (and we can't use the \uDC00 trick, obviously, or startsWith() will return false when it should not)

was (Author: rcmuir):
bq. about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum?

look at the code to nextString() itself.
it uses cleanSeek() which works differently.

when seeking, we can append \uDC00 to achieve the same thing as seeking to a high surrogate.
when using a prefix, we have to truncate the high surrogate, because we cannot use it with TermRef.startsWith() etc, it cannot be converted into UTF-8 bytes. (and we can't use the \uDC00 trick, obviously, or startsWith() will return false when it should not)

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 9:10 AM

Post #13 of 22 (964 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782026#action_12782026 ]

Uwe Schindler edited comment on LUCENE-1606 at 11/24/09 5:09 PM:
-----------------------------------------------------------------

Did he changed the FilteredTermEnum.next() loops? if yes, maybe the better approach also works for NRQ. I am just interested, but had no time to thoroughly look into the latest changes.

I am still thinking about an extension of FilteredTermEnum that works with these repositioning out of the box. But I have no good idea. The work in FilteredTerm*s*Enum is a good start, but may be extended, to also support something like a return value "JUMP_TO_NEXT_ENUM" and a mabstract method "nextEnum()" that returns null per default (no further enum).

was (Author: thetaphi):
Did he changed the FilteredTermEnum.next() loops? if yes, maybe the better approach also works for NRQ. I am just interested, but had no time to thoroughly look into the latest changes.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 24, 2009, 2:06 PM

Post #14 of 22 (957 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782198#action_12782198 ]

Robert Muir edited comment on LUCENE-1606 at 11/24/09 10:04 PM:
----------------------------------------------------------------

benchmark results from mike's idea. I don't use any heuristic, just remove the extra 'next' to show the tradeoffs.

disclaimer: against trunk with LUCENE-2075

||Pattern||Iter||AvgHits||AvgMS||AvgMS (noNext)||
|N?N?N?N|10|1000.0|37.5|28.4|
|?NNNNNN|10|10.0|6.4|6.1|
|??NNNNN|10|100.0|9.6|9.2|
|???NNNN|10|1000.0|52.7|40.9|
|????NNN|10|10000.0|300.7|262.3|
|NN??NNN|10|100.0|4.9|4.1|
|NN?N*|10|10000.0|9.6|28.9|
|?NN*|10|100000.0|80.4|235.4|
|*N|10|1000000.0|3811.6|3747.5|
|*NNNNNN|10|10.0|2098.3|2221.9|
|NNNNN??|10|100.0|2.2|2.4|

Mike my gut feeling, which will require a lot more testing, is that if the automaton accepts a finite language (in the wildcard case, no *), we should not do the next() call.
but more benchmarking is needed, with more patterns, especially on flex branch to determine if this heuristic is best.


was (Author: rcmuir):
benchmark results from mike's idea. I don't use any heuristic, just remove the extra 'next' to show the tradeoffs.

||Pattern||Iter||AvgHits||AvgMS||AvgMS (noNext)||
|N?N?N?N|10|1000.0|37.5|28.4|
|?NNNNNN|10|10.0|6.4|6.1|
|??NNNNN|10|100.0|9.6|9.2|
|???NNNN|10|1000.0|52.7|40.9|
|????NNN|10|10000.0|300.7|262.3|
|NN??NNN|10|100.0|4.9|4.1|
|NN?N*|10|10000.0|9.6|28.9|
|?NN*|10|100000.0|80.4|235.4|
|*N|10|1000000.0|3811.6|3747.5|
|*NNNNNN|10|10.0|2098.3|2221.9|
|NNNNN??|10|100.0|2.2|2.4|

Mike my gut feeling, which will require a lot more testing, is that if the automaton accepts a finite language (in the wildcard case, no *), we should not do the next() call.
but more benchmarking is needed, with more patterns, especially on flex branch to determine if this heuristic is best.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 8:27 AM

Post #15 of 22 (933 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782485#action_12782485 ]

Robert Muir edited comment on LUCENE-1606 at 11/25/09 4:26 PM:
---------------------------------------------------------------

bq. OK let's keep it for now? But somehow we need to remember when this gets onto flex branch to put it back...

I guess what I am saying is I think the latest patch, which uses the isFinite() property of the DFA to determine whether or not to seek itself versus trying next(), is the best for both trunk and flex, but for different reasons?

edit:
the different reasons being: seek is expensive in trunk, because of the SegmentTermEnum clone()
seek is "expensive" in flex, because doing a seek when we are in a loop entails unicode conversion, but next() avoids this with TermRef comparison.

It gives the best of all the scores from https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12782198&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12782198

In the future this could be refined, such that whether to try the extra next() or go ahead and seek() could instead be driven by whether or not we are 'ping-ponging against a loop', i.e. actually pingponging against a wildcard *, rather than computed up-front for the entire query. this can be determined from the state/transitions of the path being evaluated, but its not a one-liner!


was (Author: rcmuir):
bq. OK let's keep it for now? But somehow we need to remember when this gets onto flex branch to put it back...

I guess what I am saying is I think the latest patch, which uses the isFinite() property of the DFA to determine whether or not to seek itself versus trying next(), is the best for both trunk and flex, but for different reasons?
It gives the best of all the scores from https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12782198&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12782198

In the future this could be refined, such that whether to try the extra next() or go ahead and seek() could instead be driven by whether or not we are 'ping-ponging against a loop', i.e. actually pingponging against a wildcard *, rather than computed up-front for the entire query. this can be determined from the state/transitions of the path being evaluated, but its not a one-liner!


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 5, 2009, 12:44 PM

Post #16 of 22 (767 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786489#action_12786489 ]

Mark Miller edited comment on LUCENE-1606 at 12/5/09 8:43 PM:
--------------------------------------------------------------

The new WildcardQuery is holding up very well under random testing -

I'm comparing the results of the old WildcardQuery impl with the new WildcardQuery impl.

I'm using a 2 million doc english and 2 million doc french index. (wikipedia dumps)

Generating random queries - both random short strings built up from random unicode chars mixed with some random wildcards, and random english/french words from dictionaries, randomly chopped or not, with random wildcards injected. A whole lot of crazy randomness.

They have always produced the same number of results so far (a few hours of running).

The new impl is generally either a bit faster in these cases, or about the same - at worst (in general), I've seen it about .01s slower. When its faster, its offten > .1s faster (or more when a few '?' are involved).

On avg, I'd say the perf is about the same - where the new impl shines appears to be when '?' is used (as I think Robert has mentioned).

So far I haven't seen any anomalies in time taken or anything of that nature.

was (Author: markrmiller [at] gmail):
The new WildcardQuery is holding up very well under random testing -

I'm comparing the results of the old WildcardQuery impl with the new WildcardQuery impl.

I'm using a 2 million doc english and 2 million doc french index.

Generating random queries - both random short strings built up from random unicode chars mixed with some random wildcards, and random english/french words from dictionaries, randomly chopped or not, with random wildcards injected. A whole lot of crazy randomness.

They have always produced the same number of results so far (a few hours of running).

The new impl is generally either a bit faster in these cases, or about the same - at worst (in general), I've seen it about .01s slower. When its faster, its offten > .1s faster (or more when a few '?' are involved).

On avg, I'd say the perf is about the same - where the new impl shines appears to be when '?' is used (as I think Robert has mentioned).

So far I haven't seen any anomalies in time taken or anything of that nature.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 5, 2009, 3:07 PM

Post #17 of 22 (768 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786529#action_12786529 ]

Mark Miller edited comment on LUCENE-1606 at 12/5/09 11:06 PM:
---------------------------------------------------------------

bq. I think thats pretty small

Okay, fair enough ;) Guess it depends on your idea of small - though I would have guess (wrongly it appears), that it would be more. One diff is that I think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 gig dump - though that prob doesn't too many more in the scheme of things.

bq. More interesting to see the benefits...

Right, but I'm not really testing for benefits - more for correctness and no loss of performance (on a fairly standard corpus). I think the benches you have already done are probably plenty good for benefits testing.

was (Author: markrmiller [at] gmail):
bq. I think thats pretty small

Okay, fair enough ;) Guess it depends on your idea of small - though I would have guess (wrongly it appears), that it would be more. One diff is that I think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 gig dump - though that prob doesn't too many more in the scheme of things.

bq. More interesting to see the benefits...

Right, but I'm not really testing for benefits - more for correctness and no loss of performance. I think the benches you have already done are probably plenty good for benefits testing.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 5, 2009, 3:19 PM

Post #18 of 22 (766 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786529#action_12786529 ]

Mark Miller edited comment on LUCENE-1606 at 12/5/09 11:18 PM:
---------------------------------------------------------------

bq. I think thats pretty small

Okay, fair enough ;) Guess it depends on your idea of small - though I would have guess (wrongly it appears), that it would be more. One diff is that I think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 gig dump - though that prob doesn't add too many more in the scheme of things.

bq. More interesting to see the benefits...

Right, but I'm not really testing for benefits - more for correctness and no loss of performance (on a fairly standard corpus). I think the benches you have already done are probably plenty good for benefits testing.

was (Author: markrmiller [at] gmail):
bq. I think thats pretty small

Okay, fair enough ;) Guess it depends on your idea of small - though I would have guess (wrongly it appears), that it would be more. One diff is that I think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 gig dump - though that prob doesn't too many more in the scheme of things.

bq. More interesting to see the benefits...

Right, but I'm not really testing for benefits - more for correctness and no loss of performance (on a fairly standard corpus). I think the benches you have already done are probably plenty good for benefits testing.

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 5, 2009, 5:00 PM

Post #19 of 22 (771 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786551#action_12786551 ]

Mark Miller edited comment on LUCENE-1606 at 12/6/09 1:00 AM:
--------------------------------------------------------------

Sorry - haven't been paying a lot of attention to all of the Unicode issues/talk lately.

Could you briefly explain cleanupPosition? Whats the case where a seek position cannot be converted to UTF-8?

*edit*

Because of next string guesses that might not be valid UTF-8?

was (Author: markrmiller [at] gmail):
Sorry - haven't been paying a lot of attention to all of the Unicode issues/talk lately.

Could you briefly explain cleanupPosition? Whats the case where a seek position cannot be converted to UTF-8?

> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 5, 2009, 6:00 PM

Post #20 of 22 (771 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786557#action_12786557 ]

Robert Muir edited comment on LUCENE-1606 at 12/6/09 1:59 AM:
--------------------------------------------------------------

here is an explanation of the cleanupPosition, and the cases it handles

Symbols:
A = \u0000 - \uD7FFF (lower BMP)
H = \uD800 - \uDBFFF (high/lead surrogates)
L = \uDC00 - \uDFFFF (low/trail surrogates)
Z = \uE000 - \uFFFF (upper BMP)

case 1:
// an illegal combination, where we have not yet enumerated into the supp. plane
// so we increment to H + \uDC00 (the lowest possible trail surrogate)
HA -> H \uDC00
HH -> H \uDC00
case 2:
// an illegal combination where we have already enumerated the supp. plane
// we must replace both H and Z with \uE000 (the lowest possible "upper BMP")
HZ -> \uE000
case 3:
// an illegal combination where we have a trailing lead surrogate.
// we have not yet enumerated the supp plane, so append \uDC000 (lowest possible trail surrogate)
// this is just like case 1, except in final position.
H$ -> H \uDC00
case 4/5:
// an unpaired low surrogate. this is invalid when not preceded by lead surrogate
// (and if there was one, the above rules would have dealt with it already)
// in this case we have to bump to \uE0000 (the lowest possible "upper BMP")
unpaired L -> \uE000


was (Author: rcmuir):
here is an explanation of the cleanupPosition, and the cases it handles

Symbols:
A = \u0000 - \uD7FFF (lower BMP)
H = \uD800 - \uDBFFF (high/lead surrogates)
L = \uDC00 - \uDFFFF (low/trail surrogates)
Z = \uE000 - \uFFFF (upper BMP)

case 1:
// an illegal combination, where we have not yet enumerated into the supp. plane
// so we increment to H + \uDC00 (the lowest possible trail surrogate)
HA -> H \uDC00
HH -> H \uDC00
case 2:
// an illegal combination where we have already enumerated the supp. plane
// we must replace both H and L with \uE000 (the lowest possible "upper BMP")
HZ -> \uE000
case 3:
// an illegal combination where we have a trailing lead surrogate.
// we have not yet enumerated the supp plane, so append \uDC000 (lowest possible trail surrogate)
// this is just like case 1, except in final position.
H$ -> H \uDC00
case 4/5:
// an unpaired low surrogate. this is invalid when not preceded by lead surrogate
// (and if there was one, the above rules would have dealt with it already)
// in this case we have to bump to \uE0000 (the lowest possible "upper BMP")
unpaired L -> \uE000


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 5, 2009, 6:02 PM

Post #21 of 22 (767 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786557#action_12786557 ]

Robert Muir edited comment on LUCENE-1606 at 12/6/09 2:01 AM:
--------------------------------------------------------------

here is an explanation of the cleanupPosition, and the cases it handles

Symbols:
A = \u0000 - \uD7FFF (lower BMP)
H = \uD800 - \uDBFFF (high/lead surrogates)
L = \uDC00 - \uDFFFF (low/trail surrogates)
Z = \uE000 - \uFFFF (upper BMP)

case 1:
// an illegal combination, where we have not yet enumerated into the supp. plane
// so we increment to H + \uDC00 (the lowest possible trail surrogate)
HA -> H \uDC00
HH -> H \uDC00
case 2:
// an illegal combination where we have already enumerated the supp. plane
// we must replace both H and Z with \uE000 (the lowest possible "upper BMP")
HZ -> \uE000
case 3:
// an illegal combination where we have a trailing lead surrogate.
// we have not yet enumerated the supp plane, so append \uDC000 (lowest possible trail surrogate)
// this is just like case 1, except in final position.
H$ -> H \uDC00
case 4/5:
// an unpaired low surrogate. this is invalid when not preceded by lead surrogate
// (and if there was one, the above rules would have dealt with it already)
// in this case we have to bump to \uE000 (the lowest possible "upper BMP")
unpaired L -> \uE000

edit: sorry for the many edits :)


was (Author: rcmuir):
here is an explanation of the cleanupPosition, and the cases it handles

Symbols:
A = \u0000 - \uD7FFF (lower BMP)
H = \uD800 - \uDBFFF (high/lead surrogates)
L = \uDC00 - \uDFFFF (low/trail surrogates)
Z = \uE000 - \uFFFF (upper BMP)

case 1:
// an illegal combination, where we have not yet enumerated into the supp. plane
// so we increment to H + \uDC00 (the lowest possible trail surrogate)
HA -> H \uDC00
HH -> H \uDC00
case 2:
// an illegal combination where we have already enumerated the supp. plane
// we must replace both H and Z with \uE000 (the lowest possible "upper BMP")
HZ -> \uE000
case 3:
// an illegal combination where we have a trailing lead surrogate.
// we have not yet enumerated the supp plane, so append \uDC000 (lowest possible trail surrogate)
// this is just like case 1, except in final position.
H$ -> H \uDC00
case 4/5:
// an unpaired low surrogate. this is invalid when not preceded by lead surrogate
// (and if there was one, the above rules would have dealt with it already)
// in this case we have to bump to \uE0000 (the lowest possible "upper BMP")
unpaired L -> \uE000


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Dec 5, 2009, 6:06 PM

Post #22 of 22 (768 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-1606) Automaton Query/Filter (scalable regex) [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786557#action_12786557 ]

Robert Muir edited comment on LUCENE-1606 at 12/6/09 2:05 AM:
--------------------------------------------------------------

here is an explanation of the cleanupPosition, and the cases it handles

Symbols:
A = \u0000 - \uD7FFF (lower BMP)
H = \uD800 - \uDBFFF (high/lead surrogates)
L = \uDC00 - \uDFFFF (low/trail surrogates)
Z = \uE000 - \uFFFF (upper BMP)

case 1:
// an illegal combination, where we have not yet enumerated into the supp. plane
// so we increment to H + \uDC00 (the lowest possible trail surrogate)
HA -> H \uDC00
HH -> H \uDC00
case 2:
// an illegal combination where we have already enumerated the supp. plane
// we must replace both H and Z with \uE000 (the lowest possible "upper BMP")
HZ -> \uE000
case 3:
// an illegal combination where we have a final lead surrogate.
// we have not yet enumerated the supp plane, so append \uDC00 (lowest possible trail surrogate)
// this is just like case 1, except in final position.
H$ -> H \uDC00
case 4:
// an unpaired trail surrogate. this is invalid when not preceded by lead surrogate
// (and if there was one, the above rules would have dealt with it already)
// in this case we have to bump to \uE000 (the lowest possible "upper BMP")
unpaired L -> \uE000
case 5:
// this is just like case 4, its obviously illegal because the term starts with a trail surrogate.
// (because it is in initial position)
^L -> \uE000

edit: sorry for the many edits :)


was (Author: rcmuir):
here is an explanation of the cleanupPosition, and the cases it handles

Symbols:
A = \u0000 - \uD7FFF (lower BMP)
H = \uD800 - \uDBFFF (high/lead surrogates)
L = \uDC00 - \uDFFFF (low/trail surrogates)
Z = \uE000 - \uFFFF (upper BMP)

case 1:
// an illegal combination, where we have not yet enumerated into the supp. plane
// so we increment to H + \uDC00 (the lowest possible trail surrogate)
HA -> H \uDC00
HH -> H \uDC00
case 2:
// an illegal combination where we have already enumerated the supp. plane
// we must replace both H and Z with \uE000 (the lowest possible "upper BMP")
HZ -> \uE000
case 3:
// an illegal combination where we have a trailing lead surrogate.
// we have not yet enumerated the supp plane, so append \uDC000 (lowest possible trail surrogate)
// this is just like case 1, except in final position.
H$ -> H \uDC00
case 4/5:
// an unpaired low surrogate. this is invalid when not preceded by lead surrogate
// (and if there was one, the above rules would have dealt with it already)
// in this case we have to bump to \uE000 (the lowest possible "upper BMP")
unpaired L -> \uE000

edit: sorry for the many edits :)


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
> Key: LUCENE-1606
> URL: https://issues.apache.org/jira/browse/LUCENE-1606
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Robert Muir
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
> 1. lexicography/etc on large text corpora
> 2. looking for things such as urls where the prefix is not constant (http:// or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
> The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:
>
> 1. Look at the portion that is OK (did not enter a reject state in the DFA)
> 2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.