Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Boolean expression for no terms OR matching a wildcard

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


ronchalant at gmail

Jul 10, 2008, 2:52 PM

Post #1 of 9 (594 views)
Permalink
Boolean expression for no terms OR matching a wildcard

I need to perform a query for a term that may or may not have values,
and I need to check for the conditions where either no terms are
indexed OR any and ALL indexed terms match a wildcard.

For example, say the following values were indexed as terms in the
field "myfield" in the three documents:

1) terms "abc123" and "abcdef123"
2) terms "abc123", "def123" and "abcdef123"
3) no terms

I want my query with a wildcard search of "+myfield:abc*123" to match
on both 1 and 3 but NOT 2.

Is this possible?

Thanks,
- Ron

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


ronchalant at gmail

Jul 14, 2008, 6:04 AM

Post #2 of 9 (553 views)
Permalink
Re: Boolean expression for no terms OR matching a wildcard [In reply to]

Can I assume that since nobody replied to this that there's no way to
perform this kind of search? What I think I need is two different
types of conditions:

1) a wildcard conditional that is forced to match against all indexed
values for a field
2) a conditional that matches when NO values at all are indexed for a field

-Ron


On Thu, Jul 10, 2008 at 5:52 PM, Ronald Rudy <ronchalant[at]gmail.com> wrote:
> I need to perform a query for a term that may or may not have values, and I
> need to check for the conditions where either no terms are indexed OR any
> and ALL indexed terms match a wildcard.
>
> For example, say the following values were indexed as terms in the field
> "myfield" in the three documents:
>
> 1) terms "abc123" and "abcdef123"
> 2) terms "abc123", "def123" and "abcdef123"
> 3) no terms
>
> I want my query with a wildcard search of "+myfield:abc*123" to match on
> both 1 and 3 but NOT 2.
>
> Is this possible?
>
> Thanks,
> - Ron
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


hossman_lucene at fucit

Jul 15, 2008, 1:47 PM

Post #3 of 9 (543 views)
Permalink
Re: Boolean expression for no terms OR matching a wildcard [In reply to]

Assuming i understand your question: the fact that your first clause is a
wildcard query is irrelevant, to generalize your request you want a way to
query for all docs which either match some sub query, or have no terms in
the field at all. to find all docs with no terms for a given field, you
need to search for all docs (MatchAllDocs), then exclude the docs that
have a value for
thatfield -- which is easy to do using a ConstantScoreRangeQuery where the
lower and upper bounds are both null.

essentially you want something like...

myfield:abc*123 (+MatchAllDocsQuery -myfield:[* TO *])

(except you'll have to construct the MatchAllDocsQuery and the two
BooleanQueries yourself)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


ronchalant at gmail

Jul 15, 2008, 3:14 PM

Post #4 of 9 (544 views)
Permalink
Re: Boolean expression for no terms OR matching a wildcard [In reply to]

Thanks Chris (or if you prefer, Hoss) - I will definitely try that for
matching no docs, but one of the problems I'm having is that I'm
indexing multiple terms for one field and I need ALL the terms to
match it.

Maybe this is easier ... suppose what I'm indexing is a phone number,
and there are multiple phone numbers for what I'm indexing under the
same field (phone) and I want the wildcard query to match only records
that have either no phone numbers at all OR where ALL phone numbers
are in a specific area code (e.g. 800* would match all in the 800 area
code).

I would want something like

+phone:800*

but I want this query to ALSO exclude any hits that have any records
that DON'T match the wildcard ... so if a record has an 800* number
AND a 900* number I don't want it to be in the results..

-Ron


On Jul 15, 2008, at 4:47 PM, Chris Hostetter wrote:

>
> Assuming i understand your question: the fact that your first clause
> is a
> wildcard query is irrelevant, to generalize your request you want a
> way to
> query for all docs which either match some sub query, or have no
> terms in
> the field at all. to find all docs with no terms for a given field,
> you
> need to search for all docs (MatchAllDocs), then exclude the docs that
> have a value for
> thatfield -- which is easy to do using a ConstantScoreRangeQuery
> where the
> lower and upper bounds are both null.
>
> essentially you want something like...
>
> myfield:abc*123 (+MatchAllDocsQuery -myfield:[* TO *])
>
> (except you'll have to construct the MatchAllDocsQuery and the two
> BooleanQueries yourself)
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


hossman_lucene at fucit

Jul 18, 2008, 3:00 PM

Post #5 of 9 (502 views)
Permalink
Re: Boolean expression for no terms OR matching a wildcard [In reply to]

: Maybe this is easier ... suppose what I'm indexing is a phone number, and
: there are multiple phone numbers for what I'm indexing under the same field
: (phone) and I want the wildcard query to match only records that have either
: no phone numbers at all OR where ALL phone numbers are in a specific area code
: (e.g. 800* would match all in the 800 area code).

i can't think of anyway to accomplish the second part of your query.
specificly, given the following records...

Doc1: field1:AAA, field1:Aaa, field1:Bb, field1:C, field2:X, field3:Y
Doc2: field1:AAA, field1:Aaa, field1:Aa, field2:Z

...i can't think of any type of query like field1:A* which would match
Doc2 but not Doc1 (because there are other field1 values that do not start
with 'A')



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


eksdev at yahoo

Jul 18, 2008, 3:24 PM

Post #6 of 9 (502 views)
Permalink
Re: Boolean expression for no terms OR matching a wildcard [In reply to]

Analyzer that detects your condition "ALL match something", if possible at all...
e.g. "800123456 80034543534 80023423423" -> 800

than you put it in ALL_MATCH field and match this condition against it... if this prefix needs to be variable, you could extract all matching prefixes to this fiield an make your query work like "ALL_MATCH:800" and care not for the rest :) than yo would not need field1 at all for these queries

you were looking for something like this or you need "Query solution"?



----- Original Message ----
> From: Chris Hostetter <hossman_lucene[at]fucit.org>
> To: java-user[at]lucene.apache.org
> Sent: Saturday, 19 July, 2008 12:00:39 AM
> Subject: Re: Boolean expression for no terms OR matching a wildcard
>
> : Maybe this is easier ... suppose what I'm indexing is a phone number, and
> : there are multiple phone numbers for what I'm indexing under the same field
> : (phone) and I want the wildcard query to match only records that have either
> : no phone numbers at all OR where ALL phone numbers are in a specific area code
> : (e.g. 800* would match all in the 800 area code).
>
> i can't think of anyway to accomplish the second part of your query.
> specificly, given the following records...
>
> Doc1: field1:AAA, field1:Aaa, field1:Bb, field1:C, field2:X, field3:Y
> Doc2: field1:AAA, field1:Aaa, field1:Aa, field2:Z
>
> ...i can't think of any type of query like field1:A* which would match
> Doc2 but not Doc1 (because there are other field1 values that do not start
> with 'A')
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org



__________________________________________________________
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


ronchalant at gmail

Jul 20, 2008, 5:29 AM

Post #7 of 9 (476 views)
Permalink
Re: Boolean expression for no terms OR matching a wildcard [In reply to]

A query solution is preferable.. but I can programmatically filter my
results after the fact, it just seems like something that the Lucene
team should consider adding.. I think it would only have value for
wildcard queries, but nonetheless it would have some value I think..

-Ron


On Jul 18, 2008, at 6:24 PM, eks dev wrote:

> Analyzer that detects your condition "ALL match something", if
> possible at all...
> e.g. "800123456 80034543534 80023423423" -> 800
>
> than you put it in ALL_MATCH field and match this condition against
> it... if this prefix needs to be variable, you could extract all
> matching prefixes to this fiield an make your query work like
> "ALL_MATCH:800" and care not for the rest :) than yo would not need
> field1 at all for these queries
>
> you were looking for something like this or you need "Query solution"?
>
>
>
> ----- Original Message ----
>> From: Chris Hostetter <hossman_lucene[at]fucit.org>
>> To: java-user[at]lucene.apache.org
>> Sent: Saturday, 19 July, 2008 12:00:39 AM
>> Subject: Re: Boolean expression for no terms OR matching a wildcard
>>
>> : Maybe this is easier ... suppose what I'm indexing is a phone
>> number, and
>> : there are multiple phone numbers for what I'm indexing under the
>> same field
>> : (phone) and I want the wildcard query to match only records that
>> have either
>> : no phone numbers at all OR where ALL phone numbers are in a
>> specific area code
>> : (e.g. 800* would match all in the 800 area code).
>>
>> i can't think of anyway to accomplish the second part of your query.
>> specificly, given the following records...
>>
>> Doc1: field1:AAA, field1:Aaa, field1:Bb, field1:C, field2:X,
>> field3:Y
>> Doc2: field1:AAA, field1:Aaa, field1:Aa, field2:Z
>>
>> ...i can't think of any type of query like field1:A* which would
>> match
>> Doc2 but not Doc1 (because there are other field1 values that do
>> not start
>> with 'A')
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>
>
> __________________________________________________________
> Not happy with your email address?.
> Get the one you really want - millions of new email addresses
> available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


sarowe at syr

Jul 21, 2008, 12:00 PM

Post #8 of 9 (460 views)
Permalink
RE: Boolean expression for no terms OR matching a wildcard [In reply to]

Hi Ronald,

Caveat - I haven't tested this, but:

With a RegexQuery <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/regex/RegexQuery.html>, I think you can do something like (using your example):

+abc*123 -{Regex}(?!abc.*123$)

This query would include all documents that have terms that match the wildcard "abc*123", and exclude all documents containing terms that don't match regex "^abc.*123$".

Note that the Lucene QueryParser doesn't handle regex queries (and if it did, the syntax would probably be different than "{Regex}" - this was intended solely for purposes of exposition). As a result, you would have to manually construct the RegexQuery and combine it using BooleanQuery clauses with your wildcard query.

The "(?!...)" syntax is a negative lookahead assertion - this is a Java 1.4+ java.util.regex.Pattern feature. Note that wildcard expressions are easily programmatically converted to regular expressions by substituting "*"->".*" and "?"->".", and then adding the "$" anchor. The "^" anchor is not required with RegexQuery's, because when using the java.util.regex engine (the default engine), j.u.r.Matcher.lookingAt() is used; from <http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Matcher.html#lookingAt()>:

Attempts to match the input sequence, starting at the
beginning, against the pattern.

Like the matches method, this method always starts at the
beginning of the input sequence; unlike that method, it
does not require that the entire input sequence be matched.

Caveat #2: RegexQuery's are relatively slow, since *all* index terms have to be tested against the regular expression, so you may have to use some other method if query response time turns out to be a problem.

Steve

On 07/20/2008 at 8:29 AM, Ronald Rudy wrote:
> A query solution is preferable.. but I can programmatically
> filter my results after the fact, it just seems like something that
> the Lucene team should consider adding.. I think it would only have
> value for wildcard queries, but nonetheless it would have some value
> I think..
>
> -Ron
>
> On Jul 18, 2008, at 6:24 PM, eks dev wrote:
>
> > Analyzer that detects your condition "ALL match something", if
> > possible at all...
> > e.g. "800123456 80034543534 80023423423" -> 800
> >
> > than you put it in ALL_MATCH field and match this condition against
> > it... if this prefix needs to be variable, you could extract all
> > matching prefixes to this fiield an make your query work like
> > "ALL_MATCH:800" and care not for the rest :) than yo would not need
> > field1 at all for these queries
> >
> > you were looking for something like this or you need "Query solution"?
> >
> > ----- Original Message ----
> > > From: Chris Hostetter <hossman_lucene[at]fucit.org>
> > > To: java-user[at]lucene.apache.org
> > > Sent: Saturday, 19 July, 2008 12:00:39 AM
> > > Subject: Re: Boolean expression for no terms OR matching a wildcard
> > >
> > > > Maybe this is easier ... suppose what I'm indexing is a phone number,
> > > > and there are multiple phone numbers for what I'm indexing under the
> > > > same field (phone) and I want the wildcard query to match only
> > > > records that have either no phone numbers at all OR where ALL phone
> > > > numbers are in a specific area code (e.g. 800* would match all in the
> > > > 800 area code).
> > >
> > > i can't think of anyway to accomplish the second part of your query.
> > > specificly, given the following records...
> > >
> > > Doc1: field1:AAA, field1:Aaa, field1:Bb, field1:C, field2:X, field3:Y
> > > Doc2: field1:AAA, field1:Aaa, field1:Aa, field2:Z
> > >
> > > ...i can't think of any type of query like field1:A* which would match
> > > Doc2 but not Doc1 (because there are other field1 values that do
> > > not start with 'A')
> > >
> > > -Hoss



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


ronchalant at gmail

Jul 21, 2008, 12:05 PM

Post #9 of 9 (461 views)
Permalink
Re: Boolean expression for no terms OR matching a wildcard [In reply to]

Thanks Steve, this looks promising even if it doesn't perform the
best. I'll run some tests on what produces the best results.

-Ron


On Jul 21, 2008, at 3:00 PM, Steven A Rowe wrote:

> Hi Ronald,
>
> Caveat - I haven't tested this, but:
>
> With a RegexQuery <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/regex/RegexQuery.html
> >, I think you can do something like (using your example):
>
> +abc*123 -{Regex}(?!abc.*123$)
>
> This query would include all documents that have terms that match
> the wildcard "abc*123", and exclude all documents containing terms
> that don't match regex "^abc.*123$".
>
> Note that the Lucene QueryParser doesn't handle regex queries (and
> if it did, the syntax would probably be different than "{Regex}" -
> this was intended solely for purposes of exposition). As a result,
> you would have to manually construct the RegexQuery and combine it
> using BooleanQuery clauses with your wildcard query.
>
> The "(?!...)" syntax is a negative lookahead assertion - this is a
> Java 1.4+ java.util.regex.Pattern feature. Note that wildcard
> expressions are easily programmatically converted to regular
> expressions by substituting "*"->".*" and "?"->".", and then adding
> the "$" anchor. The "^" anchor is not required with RegexQuery's,
> because when using the java.util.regex engine (the default engine),
> j.u.r.Matcher.lookingAt() is used; from <http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Matcher.html#lookingAt()
> >:
>
> Attempts to match the input sequence, starting at the
> beginning, against the pattern.
>
> Like the matches method, this method always starts at the
> beginning of the input sequence; unlike that method, it
> does not require that the entire input sequence be matched.
>
> Caveat #2: RegexQuery's are relatively slow, since *all* index terms
> have to be tested against the regular expression, so you may have to
> use some other method if query response time turns out to be a
> problem.
>
> Steve
>
> On 07/20/2008 at 8:29 AM, Ronald Rudy wrote:
>> A query solution is preferable.. but I can programmatically
>> filter my results after the fact, it just seems like something that
>> the Lucene team should consider adding.. I think it would only have
>> value for wildcard queries, but nonetheless it would have some value
>> I think..
>>
>> -Ron
>>
>> On Jul 18, 2008, at 6:24 PM, eks dev wrote:
>>
>>> Analyzer that detects your condition "ALL match something", if
>>> possible at all...
>>> e.g. "800123456 80034543534 80023423423" -> 800
>>>
>>> than you put it in ALL_MATCH field and match this condition against
>>> it... if this prefix needs to be variable, you could extract all
>>> matching prefixes to this fiield an make your query work like
>>> "ALL_MATCH:800" and care not for the rest :) than yo would not need
>>> field1 at all for these queries
>>>
>>> you were looking for something like this or you need "Query
>>> solution"?
>>>
>>> ----- Original Message ----
>>>> From: Chris Hostetter <hossman_lucene[at]fucit.org>
>>>> To: java-user[at]lucene.apache.org
>>>> Sent: Saturday, 19 July, 2008 12:00:39 AM
>>>> Subject: Re: Boolean expression for no terms OR matching a wildcard
>>>>
>>>>> Maybe this is easier ... suppose what I'm indexing is a phone
>>>>> number,
>>>>> and there are multiple phone numbers for what I'm indexing under
>>>>> the
>>>>> same field (phone) and I want the wildcard query to match only
>>>>> records that have either no phone numbers at all OR where ALL
>>>>> phone
>>>>> numbers are in a specific area code (e.g. 800* would match all
>>>>> in the
>>>>> 800 area code).
>>>>
>>>> i can't think of anyway to accomplish the second part of your
>>>> query.
>>>> specificly, given the following records...
>>>>
>>>> Doc1: field1:AAA, field1:Aaa, field1:Bb, field1:C, field2:X,
>>>> field3:Y
>>>> Doc2: field1:AAA, field1:Aaa, field1:Aa, field2:Z
>>>>
>>>> ...i can't think of any type of query like field1:A* which would
>>>> match
>>>> Doc2 but not Doc1 (because there are other field1 values that do
>>>> not start with 'A')
>>>>
>>>> -Hoss
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.