Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

DO NOT REPLY [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search?

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


bugzilla at apache

Aug 28, 2002, 12:52 PM

Post #1 of 21 (360 views)
Permalink
DO NOT REPLY [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search?

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12137>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=12137

Can '*' or '?' symbol be used as the first character of a search?

Summary: Can '*' or '?' symbol be used as the first character of
a search?
Product: Lucene
Version: 1.2
Platform: Other
OS/Version: Other
Status: NEW
Severity: Normal
Priority: Other
Component: QueryParser
AssignedTo: lucene-dev [at] jakarta
ReportedBy: tlai [at] leversoft


Do get me wrong, I did read the Parser Syntax, and understand that:
"Note: You cannot use a * or ? symbol as the first character of a search."
However, It would have been nice for this feature. I made the following
changes to QueryParser.jj, and it seems work fine. I am not sure if there is
any side effect though. Can someone verify this?

Change from:

| <WILDTERM: <_TERM_START_CHAR>
(<_TERM_CHAR> | ( [ "*", "?" ] ))* >


To:

| <WILDTERM: (<_TERM_CHAR> | ( [ "*", "?" ] ))* >

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


brian at quiotix

Aug 28, 2002, 1:03 PM

Post #2 of 21 (354 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

On Wed, Aug 28, 2002 at 07:52:01PM -0000, bugzilla [at] apache wrote:
> Do get me wrong, I did read the Parser Syntax, and understand that:
> "Note: You cannot use a * or ? symbol as the first character of a search."
> However, It would have been nice for this feature. I made the following
> changes to QueryParser.jj, and it seems work fine. I am not sure if there is
> any side effect though. Can someone verify this?

I think this is a bad idea.

First of all, the query parser is a CONVENIENCE, not the only way to
build query objects. If the query parser language is too restrictive,
then build the query objects programmatically. Its not that hard.

There were reasons why the query language was designed this way. If
you think that's an error, first you need to lobby for your position
to change the design, THEN we can think about changing the parser.

Parser are tricky. Small changes can have big, unexpected effects.
Lets make sure we want to do this first (which I think we don't), and
then we can look at the implementation.

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


Armbrust.Daniel at mayo

Aug 28, 2002, 2:06 PM

Post #3 of 21 (355 views)
Permalink
RE: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

I don't know much about the implementation details of the Query Parser.

From previous conversations on the list, I get the idea that allowing wildcards as the first letter of a search introduces a large performance hit.

But I also think that this feature should be implemented by a search engine, so that it is easily accessible. Even if it is not programmatically difficult to manually build the query, most beginners are going to use the parser, and then ask the question why doesn't this work. The prospect of building the query manually will sound difficult, and may discourage them from using Lucene.

So, if it can be implemented in such a way that you only take the performance hit when you put the wildcard as the first letter, I would like to see that implemented.

If it causes a hit on all searches, then I think there should be more than one query parser available - and then users could choose themselves if they want to pay the performance price of more powerful parser.

Just my thoughts as a user,

Dan


-----Original Message-----
From: Brian Goetz [mailto:brian [at] quiotix]
Sent: Wednesday, August 28, 2002 3:03 PM
To: Lucene Developers List
Subject: Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the
first character of a search?


On Wed, Aug 28, 2002 at 07:52:01PM -0000, bugzilla [at] apache wrote:
> Do get me wrong, I did read the Parser Syntax, and understand that:
> "Note: You cannot use a * or ? symbol as the first character of a search."
> However, It would have been nice for this feature. I made the following
> changes to QueryParser.jj, and it seems work fine. I am not sure if there is
> any side effect though. Can someone verify this?

I think this is a bad idea.

First of all, the query parser is a CONVENIENCE, not the only way to
build query objects. If the query parser language is too restrictive,
then build the query objects programmatically. Its not that hard.

There were reasons why the query language was designed this way. If
you think that's an error, first you need to lobby for your position
to change the design, THEN we can think about changing the parser.

Parser are tricky. Small changes can have big, unexpected effects.
Lets make sure we want to do this first (which I think we don't), and
then we can look at the implementation.

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


brian at quiotix

Aug 28, 2002, 3:03 PM

Post #4 of 21 (351 views)
Permalink
RE: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

>But I also think that this feature should be implemented by a search
>engine, so that it is easily accessible. Even if it is not
>programmatically difficult to manually build the query, most beginners are
>going to use the parser, and then ask the question why doesn't this
>work. The prospect of building the query manually will sound difficult,
>and may discourage them from using Lucene.
>
>So, if it can be implemented in such a way that you only take the
>performance hit when you put the wildcard as the first letter, I would
>like to see that implemented.

This is a sensible-sounding argument, but contains some hidden assumptions
about your user base which can lead to very bad results in general.

Lets call a LUCENE DEVELOPER someone who understands the internals of
Lucene. (Such as me.)
We'll call an APP DEVELOPER someone who uses Lucene to build an
application. He understands the general issues involved in search and
retrieval. (Such as you.)
We'll call an APP USER someone who doesn't know anything about Java,
Lucene, programming, or anything, but knows how to use Google.

You are saying "I'm a savvy APP DEVELOPER, I know that certain search
patterns are expensive, but why should I be precluded from using
them? I'll be careful, and if I screw up, its my problem."

That statement might be true if the universe of users of your app included
only APP DEVELOPERS. But the query parser is explicitly designed for APP
USERS. They don't know that certain classes of queries are much more
expensive. They don't even know what "expensive" means in this
context. So a user enters "*" into a search box, and it takes a really
long time to run. Maybe they assume something got hung up somwhere, and
they open _another_ browser window, and enter the same search. Now some
random user has innocently created a DoS attack on your system.

The Query Parser is a convenience for making common search options
available to APP USERS. As such, it _must_ be designed with the assumption
that the end user is an APP USER, not an APP DEVELOPER.

Your users are smarter than this? Great! Build your queries with the
query constructors. They're not hard to use. But just because your users
are smart enough to use a chainsaw without cutting off their legs, doesn't
mean that we should hand out chainsaws to APP USERS all over the world.



--
Brian Goetz
Quiotix Corporation
brian [at] quiotix Tel: 650-843-1300 Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


carlson at bookandhammer

Aug 29, 2002, 8:54 AM

Post #5 of 21 (356 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

Hi,

Is the rationale why this is a "bad idea" mostly a performance
argument? So if you don't have to search through every term in the
index, then the results will return much faster -- right.

I understand the concern, but without some benchmark if the desired
result is beneficial to a user then we might want to explore it more.
Or should we just say that's it's a bad idea based on the inherent
issues with the design?

I would like to have benchmarks for a few reasons
1) To be able to help resolve these kind of questions
2) provide people performance benchmarks when evaluating Lucene.

What would be a reasonable performance benchmark to test this against?

1) CPU speed - Pentium III/800Mhz+? Pentium 4/1.5GHz+? Ultrasparc
IIi/440Mhz+?
2) Index size (# terms) - 100K, 500K, 1M, 2M
?What does the index store - the terms, the terms and data?
3) Query - single term, 5 terms (AND), 5 terms (OR), wildcard (END),
wildcard (start), wildcard (Start and end)

Kelvin put out something a while ago on this.

Thoughts.

--Peter

On Wednesday, August 28, 2002, at 01:03 PM, Brian Goetz wrote:

> On Wed, Aug 28, 2002 at 07:52:01PM -0000, bugzilla [at] apache wrote:
>> Do get me wrong, I did read the Parser Syntax, and understand that:
>> "Note: You cannot use a * or ? symbol as the first character of a
>> search."
>> However, It would have been nice for this feature. I made the
>> following
>> changes to QueryParser.jj, and it seems work fine. I am not sure if
>> there is
>> any side effect though. Can someone verify this?
>
> I think this is a bad idea.
>
> First of all, the query parser is a CONVENIENCE, not the only way to
> build query objects. If the query parser language is too restrictive,
> then build the query objects programmatically. Its not that hard.
>
> There were reasons why the query language was designed this way. If
> you think that's an error, first you need to lobby for your position
> to change the design, THEN we can think about changing the parser.
>
> Parser are tricky. Small changes can have big, unexpected effects.
> Lets make sure we want to do this first (which I think we don't), and
> then we can look at the implementation.
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe [at] jakarta>
> For additional commands, e-mail:
> <mailto:lucene-dev-help [at] jakarta>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


Armbrust.Daniel at mayo

Aug 29, 2002, 9:19 AM

Post #6 of 21 (352 views)
Permalink
RE: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

I think that the only difference we have is that you see the Query Parser as a convenience for the app users, while I see it as a convenience to the app users and the app developers. Its probably a good thing to ship lucene with the query parser that doesn't allow you to do the search that makes an expensive hit. However, it seems to lead to a bug report every few weeks....

For those of us app developers that need to support a wild card query that can have the wild cards anywhere, I (as a lazy app developer) would like to be able to plug in a different query parser (that has at least been checked by and is hopefully supported as a part of lucene by those that know the lucene internals for validity) and will hopefully be aware that this parser will have worse performance on queries that have leading wildcards because I was warned when I downloaded it, or something along those lines. Then I as a developer will take measures as appropriate to make sure the users don't create a DOS attack on my system if this performance hit is significant on my index.

I understand that this probably makes you cringe, however as a lucene developer, since now you would have 2 parsers to support.

Given the number of times the question is asked, something probably should be changed... but I don't know which solution that has been given so far should be used (if any) since they all have significant downsides. As an app developer, I would have to lean toward "make it easy for me - aka make the query parser do the work". But these are all just my thoughts, and I don't actually have to write the query parser, so its easy for me to say.

Dan



*****************************
Daniel C. Armbrust
Medical Informatics Research
Information Services
Mayo Clinic Rochester
*****************************



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


carlson at bookandhammer

Aug 29, 2002, 9:35 AM

Post #7 of 21 (354 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

I think you're right that people want different options and we always
except contributions, although they may not be officially supported.

Please create a QueryParser.jj which works for you and we can add it to
the contributions page. If we can get a set of Parsers which work for
people, maybe that will better steer what should be in the official
version.

Thanks

--Peter

On Thursday, August 29, 2002, at 09:19 AM, Armbrust, Daniel C. wrote:

> I think that the only difference we have is that you see the Query
> Parser as a convenience for the app users, while I see it as a
> convenience to the app users and the app developers. Its probably a
> good thing to ship lucene with the query parser that doesn't allow you
> to do the search that makes an expensive hit. However, it seems to
> lead to a bug report every few weeks....
>
> For those of us app developers that need to support a wild card query
> that can have the wild cards anywhere, I (as a lazy app developer)
> would like to be able to plug in a different query parser (that has at
> least been checked by and is hopefully supported as a part of lucene
> by those that know the lucene internals for validity) and will
> hopefully be aware that this parser will have worse performance on
> queries that have leading wildcards because I was warned when I
> downloaded it, or something along those lines. Then I as a developer
> will take measures as appropriate to make sure the users don't create
> a DOS attack on my system if this performance hit is significant on my
> index.
>
> I understand that this probably makes you cringe, however as a lucene
> developer, since now you would have 2 parsers to support.
>
> Given the number of times the question is asked, something probably
> should be changed... but I don't know which solution that has been
> given so far should be used (if any) since they all have significant
> downsides. As an app developer, I would have to lean toward "make it
> easy for me - aka make the query parser do the work". But these are
> all just my thoughts, and I don't actually have to write the query
> parser, so its easy for me to say.


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


cutting at lucene

Aug 29, 2002, 9:48 AM

Post #8 of 21 (351 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

Armbrust, Daniel C. wrote:
> For those of us app developers that need to support a wild card query that can have the wild cards anywhere, I (as a lazy app developer) would like to be able to plug in a different query parser (that has at least been checked by and is hopefully supported as a part of lucene by those that know the lucene internals for validity) and will hopefully be aware that this parser will have worse performance on queries that have leading wildcards because I was warned when I downloaded it, or something along those lines. Then I as a developer will take measures as appropriate to make sure the users don't create a DOS attack on my system if this performance hit is significant on my index.
>
> I understand that this probably makes you cringe, however as a lucene developer, since now you would have 2 parsers to support.

Maybe the current query parser could be modified to accept a parameter
that determines the minimum number of characters that must occur in a
term before a wildcard. Its default could be one or two, but developers
could set it to zero if they want.

Doug


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


brian at quiotix

Aug 29, 2002, 10:25 AM

Post #9 of 21 (352 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

> However, it seems to lead to a bug report every few weeks....

Actually, there are very few bug reports for the query parser. But
there are a LOT of "I want it to work this way, because that would be
more convenient for MY application." I guess that means a lot of
people are using it :) And a lot of feature enhancement requests, some
of which are sensible requests.

> Given the number of times the question is asked, something probably
> should be changed...

Yes, probably the FAQ!

> but I don't know which solution that has been given so far should be
> used (if any) since they all have significant downsides.

There's a tremendous downside to having more than one query parser,
even those in "contributions", unless they are radically different
from each other. (I would welcome a query parser that took a
different approach to query specification -- but I am pretty resistent
to those that look a lot like the one we have already.) To name a few
risks, there is the confusion risk (new users picks the wrong one with
bad results, both for the user, and for the reputation of the project
as a whole), the support cost of maintaining more than one, the
obligation to document each and their differences, etc.

> As an app developer, I would have to lean toward "make it easy for
> me - aka make the query parser do the work".

Understood. Its always attractive to push your work onto someone
else, no doubt. And as a project maintainer, I don't even mind this,
when I think that work is beneficial to rest of the community as well.
And in this particular case, its not even "work", as the modification
is trivial. However, here's an example where we, as a group, thought
that this particular "benefit" is harmful to the community. And, I
think if you hang out here long enough, you'll think that too.

What you're saying is "I'm comfortable taking the risks of using the
chainsaw." And that's fine -- please be careful. But as maintainers
(and stewards), we have to be a little more careful with tools that
are designed for use by users who know nothing about search engine
internals, which is what the query parser is.

So feel free to take the query parser and modify it for your needs,
and call it DansModifiedQueryParser. But that doesn't make it
appropriate for calling it the default query parser.


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


Armbrust.Daniel at mayo

Aug 29, 2002, 2:59 PM

Post #10 of 21 (353 views)
Permalink
RE: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

I think that was what started the discussion... panhenryk posted it to the mailing list on Monday (subject - wildcard preceding term solution), and then (I'm assuming this is the same person but it may not be) tlai [at] leversoft submitted this bug report about it with his changes attached. Both posts asked someone else to verify that his changes didn't break something else not obvious to him.


*****************************
Daniel C. Armbrust
Medical Informatics Research
Information Services
Mayo Clinic Rochester
*****************************


-----Original Message-----
From: Peter Carlson [mailto:carlson [at] bookandhammer]
Sent: Thursday, August 29, 2002 11:35 AM
To: Lucene Developers List
Subject: Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the
first character of a search?


I think you're right that people want different options and we always
except contributions, although they may not be officially supported.

Please create a QueryParser.jj which works for you and we can add it to
the contributions page. If we can get a set of Parsers which work for
people, maybe that will better steer what should be in the official
version.

Thanks

--Peter


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


brian at quiotix

Aug 29, 2002, 3:09 PM

Post #11 of 21 (355 views)
Permalink
RE: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

>I think that was what started the discussion... panhenryk posted it to the
>mailing list on Monday (subject - wildcard preceding term solution), and
>then (I'm assuming this is the same person but it may not be)
>tlai [at] leversoft submitted this bug report about it with his changes
>attached. Both posts asked someone else to verify that his changes didn't
>break something else not obvious to him.

Moving up a level...

This is one of the big challenges facing open-source projects. The code is
out there, so people change it to suit their needs -- which is great. Then
they want to submit those changes back to the project. Sometimes this is
great -- but sometimes its not. People very rarely stop to think "Is this
change good for the majority of users?" People are very quick to assume
that all the other users are a lot like them. And they don't even realize
they're making this assumption.

In other words, just because a change CAN be made easily, doesn't mean that
its a good idea. The discussion of whether its a good idea should ALWAYS
precede any attempt to submit the change.



--
Brian Goetz
Quiotix Corporation
brian [at] quiotix Tel: 650-843-1300 Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


cutting at lucene

Aug 29, 2002, 3:40 PM

Post #12 of 21 (355 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

Did my suggestion not make sense?

I think we can make everyone happy here. By adding a parameter to the
existing query parser we can:
1. Keep things so that the default behaviour is not to permit initial
wildcards.
2. Make it so that developers who want to permit initial wildcards
can easily do so.
3. Keep a single version of the query parser.

Brian, do you have a problem with this approach? Does anyone else?

If not, then it's just a SMOP.

Doug


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


carlson at bookandhammer

Aug 29, 2002, 4:23 PM

Post #13 of 21 (351 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

I think this is a great idea.

--Peter

On Thursday, August 29, 2002, at 03:40 PM, Doug Cutting wrote:

> Did my suggestion not make sense?
>
> I think we can make everyone happy here. By adding a parameter to the
> existing query parser we can:
> 1. Keep things so that the default behaviour is not to permit
> initial wildcards.
> 2. Make it so that developers who want to permit initial wildcards
> can easily do so.
> 3. Keep a single version of the query parser.
>
> Brian, do you have a problem with this approach? Does anyone else?
>
> If not, then it's just a SMOP.
>
> Doug
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe [at] jakarta>
> For additional commands, e-mail:
> <mailto:lucene-dev-help [at] jakarta>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


brian at quiotix

Aug 29, 2002, 4:25 PM

Post #14 of 21 (352 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

>I think we can make everyone happy here. By adding a parameter to the
>existing query parser we can:
> 1. Keep things so that the default behaviour is not to permit initial
> wildcards.
> 2. Make it so that developers who want to permit initial wildcards can
> easily do so.
> 3. Keep a single version of the query parser.

Well, not exactly. You can't conditionally define parsing rules.



--
Brian Goetz
Quiotix Corporation
brian [at] quiotix Tel: 650-843-1300 Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


dmitrys at earthlink

Aug 29, 2002, 4:26 PM

Post #15 of 21 (352 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

A SMOP? :)
(+1 on the idea)


Peter Carlson wrote:

> I think this is a great idea.
>
> --Peter
>
> On Thursday, August 29, 2002, at 03:40 PM, Doug Cutting wrote:
>
>> Did my suggestion not make sense?
>>
>> I think we can make everyone happy here. By adding a parameter to
>> the existing query parser we can:
>> 1. Keep things so that the default behaviour is not to permit
>> initial wildcards.
>> 2. Make it so that developers who want to permit initial wildcards
>> can easily do so.
>> 3. Keep a single version of the query parser.
>>
>> Brian, do you have a problem with this approach? Does anyone else?
>>
>> If not, then it's just a SMOP.
>>
>> Doug
>



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


cutting at lucene

Aug 29, 2002, 6:16 PM

Post #16 of 21 (353 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

Brian Goetz wrote:
> Well, not exactly. You can't conditionally define parsing rules.

That's why I said it's a SMOP (small matter of programming). The rule
that parses wildcards needs to check where the first wildcard in a term
is, and if it's too near the front, throw an exception. Or something
like that.

Doug



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


brian at quiotix

Aug 29, 2002, 7:07 PM

Post #17 of 21 (351 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

>That's why I said it's a SMOP (small matter of programming). The rule
>that parses wildcards needs to check where the first wildcard in a term
>is, and if it's too near the front, throw an exception. Or something like
>that.

Right. I wasn't suggesting it was hard, but I still think its not a
terribly good idea. Its just way too easy for some cowboy app developer to
say "why wouldn't I want this flexibility" (developers _love_ flexibility)
without fully understanding the implications, and loose a
DoS-waiting-to-happen on their user base. Some naive user comes along, a
year after this app developer who set the knob to zero has left, and then
management is left with the perception that "Lucene is unstable."

Now, in the past we talked about implementing a policy in the wildcard
query classes about "how wild" a query term could be, as an element of
database-wide policy. I think that questions of this sort belong in the
query classes, and not in the query parser anyway. But that discussion
petered out without any resolution.

I'd be interested in reopening the "database-wide policy choices"
discussion, for things like this, default choice of tokenizer (so that its
harder to make the mistake of tokenizing with one analyzer and searching
with another), etc.



--
Brian Goetz
Quiotix Corporation
brian [at] quiotix Tel: 650-843-1300 Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


carlson at bookandhammer

Aug 29, 2002, 7:29 PM

Post #18 of 21 (356 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

From a philosophical point of view, I don't believe in making something
really hard just because it may be bad. It may also be a requirement
and may work out great for certain applications. Also, I believe in
teaching people through documentation and support. I understand that a
developer may try to solve a user's problem and set a parameter that
may make things slow, but I also believe that most developers will read
about it (at least to find out how to do it and what the different
parameters mean). In this documentation can be a BIG DISCLAIMER
teaching people about the potential for a really big hit. This may mean
that the developer recommends a faster machine, or gives justification
for why it may be slow.

What do other people think about this?


This also bring up the idea of a Lucene properties file.

I like the idea of having WildcardQuery have a definable limit of terms
that it returns.

--Peter



On Thursday, August 29, 2002, at 07:07 PM, Brian Goetz wrote:

>
>> That's why I said it's a SMOP (small matter of programming). The
>> rule that parses wildcards needs to check where the first wildcard in
>> a term is, and if it's too near the front, throw an exception. Or
>> something like that.
>
> Right. I wasn't suggesting it was hard, but I still think its not a
> terribly good idea. Its just way too easy for some cowboy app
> developer to say "why wouldn't I want this flexibility" (developers
> _love_ flexibility) without fully understanding the implications, and
> loose a DoS-waiting-to-happen on their user base. Some naive user
> comes along, a year after this app developer who set the knob to zero
> has left, and then management is left with the perception that "Lucene
> is unstable."
>
> Now, in the past we talked about implementing a policy in the wildcard
> query classes about "how wild" a query term could be, as an element of
> database-wide policy. I think that questions of this sort belong in
> the query classes, and not in the query parser anyway. But that
> discussion petered out without any resolution.
>
> I'd be interested in reopening the "database-wide policy choices"
> discussion, for things like this, default choice of tokenizer (so that
> its harder to make the mistake of tokenizing with one analyzer and
> searching with another), etc.
>
>
>
> --
> Brian Goetz
> Quiotix Corporation
> brian [at] quiotix Tel: 650-843-1300 Fax:
> 650-324-8032
>
> http://www.quiotix.com
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe [at] jakarta>
> For additional commands, e-mail:
> <mailto:lucene-dev-help [at] jakarta>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


brian at quiotix

Aug 29, 2002, 8:18 PM

Post #19 of 21 (353 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

> From a philosophical point of view, I don't believe in making something
> really hard just because it may be bad.

Nor do I, in the general case.

>It may also be a requirement and may work out great for certain applications.

And the standard query construction classes, which are designed for use by
App Developers, not App Users, are there for exactly that case.

>Also, I believe in teaching people through documentation and support. I
>understand that a developer may try to solve a user's problem and set a
>parameter that may make things slow, but I also believe that most
>developers will read about it (at least to find out how to do it and what
>the different parameters mean).

But the problem is that there are two levels of indirection here -- Lucene
Developers to App Developers, App Developers to App Users. We can teach
our customers -- the App Developers -- through documentation and support,
but can we teach them to teach? Can they even do so, given that search
interfaces are generally exposed on web pages, and web pages don't come
with documentation? And those users might not even speak the same language
as the developer? Actually, there's also a third level of indirection --
app developer -> webmaster -> app user. What are the chances that any sort
of disclaimers or documentation will filter to the end of that chain?

>In this documentation can be a BIG DISCLAIMER teaching people about the
>potential for a really big hit. This may mean that the developer
>recommends a faster machine, or gives justification for why it may be slow.

Right, but this documentation, if read at all, is extremely unlikely to
make to the end user. The end user might be halfway across the world,
checking out your website for the first time.

Lets not lose sight of the whole point of the query parser -- a safe tool
for making it easy for uneducated users to execute common forms of queries,
just like they can at Yahoo or other search engines.

If you want to build a chainsaw-without-a-safety query parser, and call it
VeryDangerousDontTryThisAtHomeQueryParser, we can discuss that. But
putting a "safe/not-quite-so-safe" switch on the parser strikes me as a bad
idea.

We went through this with range queries, too. My philosophy on the query
parser is "anything that requires documentation must not belong in it",
since 99.5% of the users will not only not read the documentation, but not
realize any exists.



--
Brian Goetz
Quiotix Corporation
brian [at] quiotix Tel: 650-843-1300 Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


carlson at bookandhammer

Aug 29, 2002, 9:48 PM

Post #20 of 21 (354 views)
Permalink
Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

I agree that end users (or app users) will not worry about if a query
is expensive, they are just trying to get done what they need to get
done.

But, I would also say that we should not hinder app developers from
providing the functionality their app users need. If a query takes 5
min, but it is the most efficient way to get the information they need,
don't you think that this is the app developers choice, since they are
supporting the app user.

I think it is the responsibility of the app developer, not the Lucene
developer, to provide the infrastructure required to meet a given set
of requirements. I would agree that it is the Lucene Developer's job to
inform the app developer about the tool they are using, but I would
also say that since we don't know what problem an app developer is
trying to solve, and what resources they have so solve it Lucene
developers should try to provide tools to solve it, with the
appropriate information about using those tools.

If many app developers wants functionality, why wouldn't we offer a
configuration to solve their problem? Their response time may meet
their requirements. If they have the world as users and must have a
quick response time, then the app developers should make the decision
to not use a expensive query.

This happens all the time with other development environments. For
example, the default for Tomcat is to check if a jsp page should be
checked and recompiled on the fly if it's changed. This feature is
great for development, but slows down a production system. They did
however include the feature as a default.

Also, you are making an argument that we shouldn't include
functionality, but the app developer will implement the functionality
(as has been shown) and potentially not know the ramifications of what
they have done (also potentially has has been shown). If we provide a
configuration for an expensive query and provide details about why it's
expensive, then this will put more information in the hands of the app
developer to either include it or not in their query syntax. It won't
just be an unimplemented feature, but a feature that should be used
with caution.

The question is not if app developers will implement expensive query
features, but if they do how much will they understand what they have
developed and understand the limitations of Lucene.

--Peter



On Thursday, August 29, 2002, at 08:18 PM, Brian Goetz wrote:

> We went through this with range queries, too. My philosophy on the
> query parser is "anything that requires documentation must not belong
> in it", since 99.5% of the users will not only not read the
> documentation, but not realize any exists.


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>


Armbrust.Daniel at mayo

Sep 4, 2002, 6:01 AM

Post #21 of 21 (354 views)
Permalink
RE: [Bug 12137] New: - Can '*' or '?' symbol be used as the first character of a search? [In reply to]

I would have to agree with Peter on this issue. Also, Doug's solution makes a lot of sense to me. If I had the right to vote, that would be the solution I would choose.


*****************************
Daniel C. Armbrust
Medical Informatics Research
Information Services
Mayo Clinic Rochester
*****************************


-----Original Message-----
From: Peter Carlson [mailto:carlson [at] bookandhammer]
Sent: Thursday, August 29, 2002 11:48 PM
To: Lucene Developers List
Subject: Re: [Bug 12137] New: - Can '*' or '?' symbol be used as the
first character of a search?


I agree that end users (or app users) will not worry about if a query
is expensive, they are just trying to get done what they need to get
done.

But, I would also say that we should not hinder app developers from
providing the functionality their app users need. If a query takes 5
min, but it is the most efficient way to get the information they need,
don't you think that this is the app developers choice, since they are
supporting the app user.

I think it is the responsibility of the app developer, not the Lucene
developer, to provide the infrastructure required to meet a given set
of requirements. I would agree that it is the Lucene Developer's job to
inform the app developer about the tool they are using, but I would
also say that since we don't know what problem an app developer is
trying to solve, and what resources they have so solve it Lucene
developers should try to provide tools to solve it, with the
appropriate information about using those tools.

If many app developers wants functionality, why wouldn't we offer a
configuration to solve their problem? Their response time may meet
their requirements. If they have the world as users and must have a
quick response time, then the app developers should make the decision
to not use a expensive query.

This happens all the time with other development environments. For
example, the default for Tomcat is to check if a jsp page should be
checked and recompiled on the fly if it's changed. This feature is
great for development, but slows down a production system. They did
however include the feature as a default.

Also, you are making an argument that we shouldn't include
functionality, but the app developer will implement the functionality
(as has been shown) and potentially not know the ramifications of what
they have done (also potentially has has been shown). If we provide a
configuration for an expensive query and provide details about why it's
expensive, then this will put more information in the hands of the app
developer to either include it or not in their query syntax. It won't
just be an unimplemented feature, but a feature that should be used
with caution.

The question is not if app developers will implement expensive query
features, but if they do how much will they understand what they have
developed and understand the limitations of Lucene.

--Peter

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.