Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Pattern Analyzer

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


dseltzer at tveyes

Jul 12, 2012, 11:20 AM

Post #1 of 3 (168 views)
Permalink
Pattern Analyzer

Hello,

I have a search project which uses the Lucene PatternAnalyzer for its
text/query analysis.

At the moment it's configured like so:
analyzer = new PatternAnalyzer(Version.LUCENE_35, Pattern.compile("\\s+"),
true, null);

My goal here was to split words based on spaces and make things case
insensitive.

In thinking about this however I probably want to be a little bit more
sophisticated. I'd like to ignore punctuation which occurs at the end or
beginning of a word.

Is this simply a matter of writing a regex which treats those cases the
same as a space?

Would I use something like this:
analyzer = new PatternAnalyzer(Version.LUCENE_35,
Pattern.compile("\\s+|\\p{Punct}+\\w|\\w\\p{Punct}"), true, null);

Thanks so much!

Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Jul 13, 2012, 5:53 AM

Post #2 of 3 (161 views)
Permalink
Re: Pattern Analyzer [In reply to]

Sure, you can do it that way. But first I'd look over the zillion
tokenizers and filters
that are available and string together the ones that best suit your
need. For instance,
WhitespaceTokenizer and PatternReplaceFilter might make your regex much
easier since the PatternReplaceFilter gets just the whitespace-delimited tokens
to operate on. You can hook arbitrary numbers of Filters into your
chain, so you could add LowercaseFilter and....

But unless your case is pretty unusual, I'd claim just using the
pre-built Tokenizers
and Filters will probably work for you, or at least I'd check that out first.

Best
Erick

On Thu, Jul 12, 2012 at 2:20 PM, Dave Seltzer <dseltzer [at] tveyes> wrote:
> Hello,
>
> I have a search project which uses the Lucene PatternAnalyzer for its
> text/query analysis.
>
> At the moment it's configured like so:
> analyzer = new PatternAnalyzer(Version.LUCENE_35, Pattern.compile("\\s+"),
> true, null);
>
> My goal here was to split words based on spaces and make things case
> insensitive.
>
> In thinking about this however I probably want to be a little bit more
> sophisticated. I'd like to ignore punctuation which occurs at the end or
> beginning of a word.
>
> Is this simply a matter of writing a regex which treats those cases the
> same as a space?
>
> Would I use something like this:
> analyzer = new PatternAnalyzer(Version.LUCENE_35,
> Pattern.compile("\\s+|\\p{Punct}+\\w|\\w\\p{Punct}"), true, null);
>
> Thanks so much!
>
> Dave
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dseltzer at tveyes

Jul 13, 2012, 6:55 AM

Post #3 of 3 (161 views)
Permalink
RE: Pattern Analyzer [In reply to]

I think you're absolutely right Erick,

Thanks for the insight - that's the direction I'll be heading.

Cheers,

-D

-----Original Message-----
From: Erick Erickson [mailto:erickerickson [at] gmail]
Sent: Friday, July 13, 2012 8:53 AM
To: java-user [at] lucene
Subject: Re: Pattern Analyzer

Sure, you can do it that way. But first I'd look over the zillion
tokenizers and filters that are available and string together the ones
that best suit your need. For instance, WhitespaceTokenizer and
PatternReplaceFilter might make your regex much easier since the
PatternReplaceFilter gets just the whitespace-delimited tokens to operate
on. You can hook arbitrary numbers of Filters into your chain, so you
could add LowercaseFilter and....

But unless your case is pretty unusual, I'd claim just using the pre-built
Tokenizers and Filters will probably work for you, or at least I'd check
that out first.

Best
Erick

On Thu, Jul 12, 2012 at 2:20 PM, Dave Seltzer <dseltzer [at] tveyes> wrote:
> Hello,
>
> I have a search project which uses the Lucene PatternAnalyzer for its
> text/query analysis.
>
> At the moment it's configured like so:
> analyzer = new PatternAnalyzer(Version.LUCENE_35,
> Pattern.compile("\\s+"), true, null);
>
> My goal here was to split words based on spaces and make things case
> insensitive.
>
> In thinking about this however I probably want to be a little bit more
> sophisticated. I'd like to ignore punctuation which occurs at the end
> or beginning of a word.
>
> Is this simply a matter of writing a regex which treats those cases
> the same as a space?
>
> Would I use something like this:
> analyzer = new PatternAnalyzer(Version.LUCENE_35,
> Pattern.compile("\\s+|\\p{Punct}+\\w|\\w\\p{Punct}"), true, null);
>
> Thanks so much!
>
> Dave
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.