Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

 

 

First page Previous page 1 2 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 6, 2009, 10:48 AM

Post #1 of 37 (1280 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774364#action_12774364 ]

Uwe Schindler commented on LUCENE-2039:
---------------------------------------

Wrrrr brrrr grrrr gnarf

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 7, 2009, 12:01 AM

Post #2 of 37 (1225 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774568#action_12774568 ]

Grant Ingersoll commented on LUCENE-2039:
-----------------------------------------

The new QP framework is not proven out and doesn't have very many people using it and is still in contrib. This extension allows for a pretty simple way for people to add simple extensions to the current QP without having to do a whole lot of programming.

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 11:19 AM

Post #3 of 37 (1207 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776023#action_12776023 ]

Luis Alves commented on LUCENE-2039:
------------------------------------

I agree with Uwe,

I think we should implement this on the new queryparser using the opaque terms framework described in LUCENE-1823.

The current implementation of this patch will create backward compatibility syntax problems, for queries using "/" characters
for example "file paths" or "urls" would be affected. If we are doing this we should change the syntax to allow for opaque terms.

When we have support for opaque terms in the new queryparser, we can implement regex support with it.

Opaque terms, is a framework to extend the queryparser syntax to bypass parts of the query to a smaller parsing code (not a full parser), or a analyzer, and allow extensions of the query syntax as needed, without requiring changing the lucene code.

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 11:47 AM

Post #4 of 37 (1206 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776043#action_12776043 ]

Grant Ingersoll commented on LUCENE-2039:
-----------------------------------------

I have a need for this in the Lucene Query Parser. It simply isn't practical for me to switch to using the contrib Query Parser as that would involve a fair amount of changes in the application. As for the back compat issue, I think we can work around that by having a flag set. I'll look into it a bit more.

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 11:53 AM

Post #5 of 37 (1206 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776048#action_12776048 ]

Robert Muir commented on LUCENE-2039:
-------------------------------------

regardless of which query parser, I think it would be nice to have regex support in some query parser available.

doesn't query parser now take Version as a required argument? Maybe the back compat issue could be solved with that???

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 11:55 AM

Post #6 of 37 (1207 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776050#action_12776050 ]

Simon Willnauer commented on LUCENE-2039:
-----------------------------------------

I totally see you point but on the other hand I really miss the option to extend the old-fashion query parser. I do not see the new parser being THE lucene query parser by now.Many many people are using the javaCC parser and will do so in the future. I possibly have another solution which preserves backwards compatibility and would support the query extension too.

The alternative idea is to utilize the fact that queries enclosed in double quotes are passed to getFieldQuery() and are not interpreted by the grammar. Extension queries could be embedded in quotes while the content needs to be escaped. (that is already the case though. To identify which extension should be used we could utilize the field name and a pattern so that users could plug in extension mapped to some field name pattern. Something like: re_field:"^.\*$" -> (re_field, RegexExtension)

that would not change anything in the parser as long as no extension is registered. No new character and no backwards compat issues.

Thoughts?

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 12:15 PM

Post #7 of 37 (1208 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776057#action_12776057 ]

Simon Willnauer commented on LUCENE-2039:
-----------------------------------------

bq. I think we can work around that by having a flag set. I'll look into it a bit more.

Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks. First step would be to build a tree using jjtree. Then you need to build the symbol table and then you can traverse the tree to do your checks.

One solution would be creating a parser from two javacc files one for < 3.0 and one or 3.0 - something like robert suggested. Then we could use the Version to choose the corresponding parser impl.

simon

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 12:15 PM

Post #8 of 37 (1205 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776058#action_12776058 ]

Robert Muir commented on LUCENE-2039:
-------------------------------------

Simon, personally I would prefer the Version argument used for such things.

I know this isn't popular, but I'd actually be for having say, a 3.0 javacc grammar file that differs from the 2.9 one, with version driving it.

yeah it would be duplicated code, but its mostly auto-generated code anyway, and I think it would be simple to understand what is going on.

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 2:31 PM

Post #9 of 37 (1200 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776123#action_12776123 ]

Luis Alves commented on LUCENE-2039:
------------------------------------

Hi Simon,

I think one problem lucene has today, is that the queryparser code in very tightly integrated with the javacc code. If we continue to do that it will always be very difficult to create a standard way of making small changes to the current queryparser.

I like the implementation proposed by Simon, is very similar to the opaque term idea, but I would prefer not to overload the fileds names.
{quote}
The alternative idea is to utilize the fact that queries enclosed in double quotes are passed to getFieldQuery() and are not interpreted by the grammar. Extension queries could be embedded in quotes while the content needs to be escaped. (that is already the case though. To identify which extension should be used we could utilize the field name and a pattern so that users could plug in extension mapped to some field name pattern. Something like: re_field:"^.*$" -> (re_field, RegexExtension)
{quote}

We should decouple the user extensions from the JAVACC generated code. Just like in the new queryparser framework, the queryparser should allow for the user to register these extensions at run time, and have Interface that implement that extensions should implement.

For example, something like this:
{code}
QueryParser qp = QueryParserFactory.getInstance("3.0");
qp.registerOpaqueTerm("regexp", new QueryParserRegExpParser());
qp.registerOpaqueTerm("complex_phrases", new QueryParserComplexPhraseParser());
...
qp.parser(" regexp:\"/blah*/\" complex_phrase:\"(sun OR sunny) sky\" ",...);
{code}
Of course this is not possible with the lucene queryparser code today :(,
but this is the idea I think we should try to implement.

For the problem of field overload, is that we lose the field name information for the extensions, so we need to another solution that would allow the fieldname to be available for the extensions.

Here is another idea, that would allow for fieldnames not to be overloaded,
and allow regular term or phrase syntax for extensions.
{code}
syntax:
extension:fieldname:"syntax"

examples:
regexp:title:"/blah[a-z]+[0-9]+/" <- regexp extension, title index field
complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title index field

regexp_phrase::"/blah[a-z]+[0-9]+/" <- regexp extension, default field
complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field

title:"blah" <- regular field query

{code}



> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 3:01 PM

Post #10 of 37 (1199 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776139#action_12776139 ]

Luis Alves commented on LUCENE-2039:
------------------------------------

{code}
Grant, JavaCC only generates parsers, a flag is a semantic check. You need to do a lot more work to do those checks. First step would be to build a tree using jjtree. Then you need to build the symbol table and then you can traverse the tree to do your checks.
{code}

In the new queryparser we don't use jjtree, but the same concept is implemented in the new queryparser,
the ouput from the SyntaxParser interface is a syntax tree, this tree is not related with any lucene objects just like jjtree.
But I think this is a ugly solution.

I think if we use the new queryparser, it allows for multiple SyntaxParsers to use the same Processors and the Builders.
And with a small implementation of a SyntaxParser(javacc, jflex, antlr, java tokenizer, etc), you can use the same Processors and Builders to create a lucene query.
This will avoid duplicate code and allow for multiple syntaxes.

I don't want to be preacher here, but some of these problems are already solved in the new queryparser framework, we just need to keep improving it, by adding more syntaxes, extensions and features to it.

I know the new queryparser is not in main, but that can be fixed in 3.1, if the community thinks is stable we should move it there.



> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 11, 2009, 8:33 AM

Post #11 of 37 (1172 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776502#action_12776502 ]

Yonik Seeley commented on LUCENE-2039:
--------------------------------------

bq. I think one problem lucene has today, is that the queryparser code in very tightly integrated with the javacc code.

This almost seems more of an issue for core lucene developers - it's an annoyance that one needs to recompile the javacc grammar when just tweaking what one of the methods does. Seems like this could easily be solved by just separating into two files... the javacc grammar would have a base class that left things like getFieldQuery() unimplemented, and then the standard QueryParser (in a different java file) would override and implement those methods.

bq. We should decouple the user extensions from the JAVACC generated code.

It already is today via subclassing QueryParser and overriding methods like getFieldQuery... that's very simple for users to understand and to leverage.

bq. Just like in the new queryparser framework does, the queryparser should allow for the user to register these extensions at run time, and have Interface that extensions should implement.

I don't understand the motivation for this - it's complex and harder for a user to understand. Java's own extension mechanism (overriding) has worked perfectly fine in the past.


> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 11, 2009, 1:28 PM

Post #12 of 37 (1166 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776668#action_12776668 ]

Luis Alves commented on LUCENE-2039:
------------------------------------

Hi Yonik,

{quote}
This almost seems more of an issue for core lucene developers - it's an annoyance that one needs to recompile the javacc grammar when just tweaking what one of the methods does. Seems like this could easily be solved by just separating into two files... the javacc grammar would have a base class that left things like getFieldQuery() unimplemented, and then the standard QueryParser (in a different java file) would override and implement those methods.
{quote}

This solution does not fix the problem of having multiple syntaxes sharing the same lucene processing code. For example if you have one javacc grammar and one in antlr, you can't use lucene QueryParser, to process the output of both. You will need to re-implement the QueryParser recursive logic in a diff class to be able to use antlr.

{quote}
It already is today via subclassing QueryParser and overriding methods like getFieldQuery... that's very simple for users to understand and to leverage.
{quote}

True. This is simple, but is not customizable.
- You can't change the syntax.
- You can't reuse the QueryParser logic with other parsers
- If you do have to change syntax, you can't reuse QueryParser class anymore, you need to maintain your own copy of the class.

You can read LUCENE-1567 to understand the reasons for the new queryparser.
But the focus of the new queryparser is extensibility and customization,
without changing lucene code, but reusing lucene logic as much as possible.

If you look at TestSpanQueryParserSimpleSample in queryparser contrib, or LUCENE-1938 Precedence query parser.
It illustrates two cases that would be very difficult to do in the current QueryParser in lucene by overriding methods.

Actually the a implementation PrecedenceQueryParser exists today in contrib/misc. That contains a seperated javacc grammar and does not share any code with the main lucene Queryparser, and it illustrates the problem I described above (code duplication, impossible to reuse if grammar is different, easily gets outdated when the core queryparser changes)

I'm not trying to say the QueryParser in main is worst than the one in contrib,

What I'm trying to describe is that the one in contrib is more modular and if we build the modules
for the lucene users. The users will be able to build smarter and more sophisticated solutions using Lucene in less time.
Users can decide what modules to use in the queryparser and build their query pipelines with less work.

Users can also use the pre-built ones like StandardQueryParser or PrecedenceQueryParser, these should be as easy to use as the old queryparser in main.



> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 11, 2009, 11:33 PM

Post #13 of 37 (1160 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776884#action_12776884 ]

Adriano Crestani commented on LUCENE-2039:
------------------------------------------

This is a new feature already suggested by Luis and Shai (maybe others too) before, the ability to delegate to another parser the syntax processing of certain piece of the query string. This feature is a new feature to both: core QP and contrib QP.

So, I think we should focus more on how/when a query substring will be delegated to another parser and not discuss about how/when any logic will be applied to it. I think in both QPs, this part is already defined.

First, to identify this substring we would need a open and close token. It could be either double-quote, slash or whatever. The ideal solution would allow the user to specify these two tokens. Unfortunately, I think JavaCC is not so flexible to allow defining these tokens programatically (after parser generation by JavaCC). So we need to stick with some specific open/close token, that's one decision we need to take. Maybe we could provide a property file, where the user could specify the open/close token and regenerate Lucene QP using 'ant javacc' (which is pretty easy today). Anyway, by default, we could use any new token. I don't agree with double-quotes (as I think someone suggested), it's already used by phrases, so, slash is fine for me, as already defined in Simon's patch.

Now, about any semantic(logic) processing performed on any query substring, it will be up to the QP implementation. In the core QP, its own extension would be responsible to do this processing. In the contrib QP, the extension parser would only parse the substring and return a QueryNode, which will be later processed, after the syntax parsing is complete, by the query node processors. As I said before, this part is defined and I don't think we should discuss it on this topic.

I like Simon's patch, I think the same approach can be applied to the contrib QP. The only part I disagree is when you pass the fieldname to the extension parser, I wouldn't implement that on the contrib parser, because it assumes the syntax always has field names. Anyway, for the core QP, I see the reason why you pass the fieldname, and it's completely related to the way the core QP implements the semantic (logic) processing. So, in future, if the main core QP needs to pass a new info to its extension parser, the extension parser interface would have to be changed :S...here I go again starting a new discussion about how semantic (logic) processing should be handled :P

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 12, 2009, 12:20 PM

Post #14 of 37 (1153 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777159#action_12777159 ]

Luis Alves commented on LUCENE-2039:
------------------------------------

Simon and Adriano,
Can you comment on the example below.

{quote}
syntax:
extension:fieldname:"syntax"

examples:
regexp:title:"/blah[a-z]+[0-9]+/" <- regexp extension, title index field
complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title index field

regexp_phrase::"/blah[a-z]+[0-9]+/" <- regexp extension, default field
complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field

title:"blah" <- regular field query
{quote}

This would allow the filedname and phrases or terms to be passed to a extension, and still be very compatible with the old syntax.
(only double quotes and backslash need to be escaped in a phrase, so it should cover a big number of future extensions)

Something like this would work for base64, but it would be target at programmatic layer, since users will not be able to generate that base64 strings, and it is supported by the syntax described above.
{quote}

binary:image:"base64:TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlz"

{quote}

For extensions that won't work well with escaping double quotes and back-slash, we probably need some other delimiter, probably more than a single character
some sugestions below:
{quote}
xml style:
1) xpath:xmlfield:<[[ //title[@lang="c:\windowspath\folder" ]]>
2) xpath:xmlfield:<![CDATA[ //title[@lang="c:\windowspath\folder" ]]>

another one
3) xpath:xmlfield:\![CDATA[ //title[@lang="c:\windowspath\folder" ]]!

{quote}

Any of the sequences above is good OK with me.
This should not affect old queries very much since the new syntax tokens would be
":<[[ " and "]]>" and these shouldn't be common on any lucene queries.
Still not very user friendly, but better than the base64 approach.





> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 13, 2009, 3:36 AM

Post #15 of 37 (1136 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777465#action_12777465 ]

Simon Willnauer commented on LUCENE-2039:
-----------------------------------------

Luis,
{quote}
syntax:
extension:fieldname:"syntax"

examples:
regexp:title:"/blah[a-z]+[0-9]+/" <- regexp extension, title index field
complex_phrase:title:"(sun OR sunny) sky" <- complex_phrase extension, title index field

regexp_phrase::"/blah[a-z]+[0-9]+/" <- regexp extension, default field
complex_phrase::"(sun OR sunny) sky" <- complex_phrase extension, default field

title:"blah" <- regular field query
{quote}

This is pretty much what I suggested above. We can extend the queryparser without breaking the backwards compatibility just by adding some code which is aware of the fieldname scheme. Even this could be extendable. FieldNames are terms and therefore they can not contain unescaped special chars like : { ] ... I would not even hard code the separator into the query parser but have the field name processed by something pluggable. So If somebody wants to have a regex extension they could use re\:field: or re\:: or re_field:....
Escaping a field is easy, just like you would do it with a term.
More interesting is that we do not change any syntax, no special character but we can add a default implementation with a default implementation for extensions. This could be a whole API which takes are of creating and escaping the field name, building the query once it is passed to the extension etc.
In a first step we can resolve the extension the second step calls the extension and build the query. If no extension is registered the query parser works like in previous versions so it is all up to the user.

@Adriano:
{quote}
The only part I disagree is when you pass the fieldname to the extension parser, I wouldn't implement that on the contrib parser, because it assumes the syntax always has field names. Anyway, for the core QP, I see the reason why you pass the fieldname
{quote}

You need the field to create you query in the extension, the field will always be set to either the default field or the explicitly defined field in the query. No reason why we should not pass it.
I agree with you that we should wrap the information in a class so that we do not need to change the method signature if something has to be changed in the future. Instead we just add a new member to the wrapper though.


> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 13, 2009, 1:24 PM

Post #16 of 37 (1120 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777667#action_12777667 ]

Adriano Crestani commented on LUCENE-2039:
------------------------------------------

{quote}
This is pretty much what I suggested above. We can extend the queryparser without breaking the backwards compatibility just by adding some code which is aware of the fieldname scheme. Even this could be extendable. FieldNames are terms and therefore they can not contain unescaped special chars like : { ] ... I would not even hard code the separator into the query parser but have the field name processed by something pluggable. So If somebody wants to have a regex extension they could use re\:field: or re\:: or re_field:....
Escaping a field is easy, just like you would do it with a term.
More interesting is that we do not change any syntax, no special character but we can add a default implementation with a default implementation for extensions. This could be a whole API which takes are of creating and escaping the field name, building the query once it is passed to the extension etc.
In a first step we can resolve the extension the second step calls the extension and build the query. If no extension is registered the query parser works like in previous versions so it is all up to the user.
{quote}

+1 :)

{quote}
I agree with you that we should wrap the information in a class so that we do not need to change the method signature if something has to be changed in the future. Instead we just add a new member to the wrapper though.
{quote}

A Map should solve this problem

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 5:47 PM

Post #17 of 37 (1044 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778685#action_12778685 ]

Luis Alves commented on LUCENE-2039:
------------------------------------

+1

I'll work on changing the queryparser on Contrib, to implement that syntax for the opaque terms.

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 18, 2009, 2:24 PM

Post #18 of 37 (1012 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779694#action_12779694 ]

Luis Alves commented on LUCENE-2039:
------------------------------------

Hi Simon,

I also posted a patch in LUCENE-1823, that implements the ext:field approach, and added a junit that implements a new QParser for regex.

If you have time can you take a look at the classes in the standart2 test folder, RegexQueryParser amd TestOpaqueExtensionQuery and review the testcase



> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 10:39 AM

Post #19 of 37 (981 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780193#action_12780193 ]

David Kaelbling commented on LUCENE-2039:
-----------------------------------------

I apologize if I haven't read the comments carefully enough, but in LUCENE-2039_field_ext.patch why is ExtendableQueryParser final? That means (for example) that ComplexPhraseQueryParser cannot subclass it. In the earlier LUCENE-2039.patch the complex phrase parser picked up the changes for free.


> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 12:27 PM

Post #20 of 37 (977 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780254#action_12780254 ]

Robert Muir commented on LUCENE-2039:
-------------------------------------

Hi, in my opinion RegexParserExtension should not be tied to RegexQuery/RegexCapabilities.
This is only one possible implementation of regex support and has some scalability problems.


> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 12:39 PM

Post #21 of 37 (981 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780258#action_12780258 ]

Simon Willnauer commented on LUCENE-2039:
-----------------------------------------

bq. That means (for example) that ComplexPhraseQueryParser cannot subclass it
This patch was not meant to include ComplexPhraseQueryParser it is rather a proposal for the concept of field "overloading". But you are right the parser should not be final at all especially if you wanna override a get*query method it should be expendable.

bq. Hi, in my opinion RegexParserExtension should not be tied to RegexQuery/RegexCapabilities.
This is only one possible implementation of regex support and has some scalability problems.

Also true, but again this is just a POC to show how it would look like. Comments on the concept would be more useful by now.
I did write that up during a train ride and aimed to get some comments. I already have worked on it and will upload a new patch soon which includes RegexCapabilities + tests.
Thanks again for the pointer with the final class.

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 19, 2009, 2:43 PM

Post #22 of 37 (981 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780316#action_12780316 ]

Mark Miller commented on LUCENE-2039:
-------------------------------------

It looks like the patch puts this in core? Any compelling reason? Offhand I'd think it would go in the misc contrib with the other queryparsers that extend the core queryparser.

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 12:16 PM

Post #23 of 37 (911 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781042#action_12781042 ]

Simon Willnauer commented on LUCENE-2039:
-----------------------------------------

bq. Offhand I'd think it would go in the misc contrib with the other queryparsers that extend the core queryparser.

For sure. I will attach another patch - did not thing about that too much when I moved from the first proposal which modified the core one.


> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 25, 2009, 10:14 AM

Post #24 of 37 (816 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782540#action_12782540 ]

Robert Muir commented on LUCENE-2039:
-------------------------------------

bq. JavaUtil seems to be reasonable anyway after the latest Jakarta Regexp drama

yeah, we shouldn't mislead anyone into believing the constant prefix with jakarta actually works even now.
for example the constant prefix of (ab|ac) is not "a" but instead empty string.


> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Simon Willnauer
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 30, 2009, 1:53 AM

Post #25 of 37 (738 views)
Permalink
[jira] Commented: (LUCENE-2039) Regex support and beyond in JavaCC QueryParser [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783588#action_12783588 ]

Simon Willnauer commented on LUCENE-2039:
-----------------------------------------

The contrib/regex dependency on contrib/misc buggs me a bit though. I have the impression that this regex default extension should not be part of this patch. The extension seems to be so trivial that users could implement it on their own. This would save us the dependency and IMO would not be a problem for users though.

Any thoughts?

> Regex support and beyond in JavaCC QueryParser
> ----------------------------------------------
>
> Key: LUCENE-2039
> URL: https://issues.apache.org/jira/browse/LUCENE-2039
> Project: Lucene - Java
> Issue Type: Improvement
> Components: QueryParser
> Reporter: Simon Willnauer
> Assignee: Simon Willnauer
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2039.patch, LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch, LUCENE-2039_field_ext.patch
>
>
> Since the early days the standard query parser was limited to the queries living in core, adding other queries or extending the parser in any way always forced people to change the grammar file and regenerate. Even if you change the grammar you have to be extremely careful how you modify the parser so that other parts of the standard parser are affected by customisation changes. Eventually you had to live with all the limitation the current parser has like tokenizing on whitespaces before a tokenizer / analyzer has the chance to look at the tokens.
> I was thinking about how to overcome the limitation and add regex support to the query parser without introducing any dependency to core. I added a new special character that basically prevents the parser from interpreting any of the characters enclosed in the new special characters. I choose the forward slash '/' as the delimiter so that everything in between two forward slashes is basically escaped and ignored by the parser. All chars embedded within forward slashes are treated as one token even if it contains other special chars like * []?{} or whitespaces. This token is subsequently passed to a pluggable "parser extension" with builds a query from the embedded string. I do not interpret the embedded string in any way but leave all the subsequent work to the parser extension. Such an extension could be another full featured query parser itself or simply a ctor call for regex query. The interface remains quiet simple but makes the parser extendible in an easy way compared to modifying the javaCC sources.
> The downsides of this patch is clearly that I introduce a new special char into the syntax but I guess that would not be that much of a deal as it is reflected in the escape method though. It would truly be nice to have more than once extension an have this even more flexible so treat this patch as a kickoff though.
> Another way of solving the problem with RegexQuery would be to move the JDK version of regex into the core and simply have another method like:
> {code}
> protected Query newRegexQuery(Term t) {
> ...
> }
> {code}
> which I would like better as it would be more consistent with the idea of the query parser to be a very strict and defined parser.
> I will upload a patch in a second which implements the extension based approach I guess I will add a second patch with regex in core soon too.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

First page Previous page 1 2 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.