Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Match span of capitalized words

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


ihasmax at gmail

Feb 3, 2010, 5:57 PM

Post #1 of 4 (819 views)
Permalink
Match span of capitalized words

Hi,
I would like to do a search for "Microsoft Windows" as a span, but not match
if words before or after "Microsoft Windows" are upper cased.

For example, I want this to match: another crash for Microsoft Windows today
But not this: another crash for Microsoft Windows Server today

Is this possible? My first attempt started with the SpanRegexQuery from the
regex contrib package, but I can't figure out how to put in a term I do want
to match but don't want to include in the final highlighting match. Does
that make sense?

My example (using WhitespaceAnalyzer since I care about case):

SpanRegexQuery srq1 = new SpanRegexQuery( new Term("contents", "Chase"));
SpanRegexQuery srq2 = new SpanRegexQuery( new Term("contents",
"Bank[\\.]*"));
SpanRegexQuery srq3 = new SpanRegexQuery( new Term("contents", "[^A-Z]*"));


Thanks,
Max


gsingers at apache

Feb 5, 2010, 7:18 AM

Post #2 of 4 (730 views)
Permalink
Re: Match span of capitalized words [In reply to]

On Feb 3, 2010, at 8:57 PM, Max Lynch wrote:

> Hi,
> I would like to do a search for "Microsoft Windows" as a span, but not match
> if words before or after "Microsoft Windows" are upper cased.
>
> For example, I want this to match: another crash for Microsoft Windows today
> But not this: another crash for Microsoft Windows Server today
>
> Is this possible? My first attempt started with the SpanRegexQuery from the
> regex contrib package, but I can't figure out how to put in a term I do want
> to match but don't want to include in the final highlighting match. Does
> that make sense?
>
> My example (using WhitespaceAnalyzer since I care about case):
>
> SpanRegexQuery srq1 = new SpanRegexQuery( new Term("contents", "Chase"));
> SpanRegexQuery srq2 = new SpanRegexQuery( new Term("contents",
> "Bank[\\.]*"));
> SpanRegexQuery srq3 = new SpanRegexQuery( new Term("contents", "[^A-Z]*"));

I'm not sure it supports it, but I wonder if you could use a negative lookahead assertion? Most regex languages support it.

-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


sarowe at syr

Feb 5, 2010, 2:23 PM

Post #3 of 4 (726 views)
Permalink
RE: Match span of capitalized words [In reply to]

Hi Max,

On 02/05/2010 at 10:18 AM, Grant Ingersoll wrote:
> On Feb 3, 2010, at 8:57 PM, Max Lynch wrote:
> > Hi, I would like to do a search for "Microsoft Windows" as a span, but
> > not match if words before or after "Microsoft Windows" are upper cased.
> >
> > For example, I want this to match: another crash for Microsoft Windows
> > today But not this: another crash for Microsoft Windows Server today
> >
> > Is this possible? My first attempt started with the SpanRegexQuery
> > from the regex contrib package, but I can't figure out how to put in a
> > term I do want to match but don't want to include in the final
> > highlighting match. Does that make sense?
> >
> > My example (using WhitespaceAnalyzer since I care about case):
> >
> > SpanRegexQuery srq1 = new SpanRegexQuery(new Term("contents", "Chase"));
> > SpanRegexQuery srq2 = new SpanRegexQuery(new Term("contents", "Bank[\\.]*"));
> > SpanRegexQuery srq3 = new SpanRegexQuery(new Term("contents", "[^A-Z]*"));
>
> I'm not sure it supports it, but I wonder if you could use a negative
> lookahead assertion? Most regex languages support it.

I don't think this would work, since the input to a SpanRegexQuery regex is a single Term; following Terms are not included in the input.

I *think* you can get what you want using SpanNotQuery - something like the following, using your "Microsoft Windows" example:

SpanNot:
include:
SpanNear(in-order=true, slop=0):
SpanTerm: "Microsoft"
SpanTerm: "Windows"
exclude:
SpanNear(in-order=true, slop=0):
SpanTerm: "Microsoft"
SpanTerm: "Windows"
SpanRegex: "^\\p{Lu}.*"

Steve



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ihasmax at gmail

Feb 5, 2010, 3:06 PM

Post #4 of 4 (728 views)
Permalink
Re: Match span of capitalized words [In reply to]

>
>
> I *think* you can get what you want using SpanNotQuery - something like the
> following, using your "Microsoft Windows" example:
>
> SpanNot:
> include:
> SpanNear(in-order=true, slop=0):
> SpanTerm: "Microsoft"
> SpanTerm: "Windows"
> exclude:
> SpanNear(in-order=true, slop=0):
> SpanTerm: "Microsoft"
> SpanTerm: "Windows"
> SpanRegex: "^\\p{Lu}.*"
>
> Steve
>
>
>
>
This worked great, thank you guys!

-Max

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.