Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User
RE: Analyzer on query question
 

Index | Next | Previous | View Flat


Bill.Chesky at learninga-z

Aug 3, 2012, 2:35 PM


Views: 1357
Permalink
RE: Analyzer on query question [In reply to]

Thanks for the help everybody. We're using 3.0.1 so I couldn't do exactly what Simon and Jack suggested. But after some searching around I came up with this method:

private String analyze(String token) throws Exception {
StringBuffer result = new StringBuffer();

Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
TokenStream tokenStream = analyzer.tokenStream("title", new StringReader(token));
tokenStream.reset();
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);

while (tokenStream.incrementToken()) {
if (result.length() > 0) {
result.append(" ");
}

result.append(termAttribute.term());
}

return result.toString();
}

Now I just run my search term strings thru this method first like so:

searchTerms = analyze(searchTerms);

// now do what I was doing before to build queries...

It's still not totally clear what this buys me since ultimately the query looks the same as what was being generated with my original method (perhaps this is Ian's point in his last reply). But I will defer to the gurus. It works.

Thanks for all the help.

Bill
-----Original Message-----
From: Jack Krupansky [mailto:jack [at] basetechnology]
Sent: Friday, August 03, 2012 4:03 PM
To: java-user [at] lucene
Subject: Re: Analyzer on query question

Simon gave sample code for analyzing a multi-term string.

Here's some pseudo-code (hasn't been compiled to check it) to analyze a
single term with Lucene 3.6:

public Term analyzeTerm(Analyzer analyzer, String termString){
TokenStream stream = analyzer.tokenStream(field, new
StringReader(termString));
if (stream.incrementToken())
return new
Term(stream.getAttribute(CharacterTermAttribute.class).toString());
else
return null;
// TODO: Close the StringReader
// TODO: Handle terms that analyze into multiple terms (e.g., embedded
punctuation)
}

And here's the corresponding code for Lucene 4.0:

public Term analyzeTerm(Analyzer analyzer, String termString){
TokenStream stream = analyzer.tokenStream(field, new
StringReader(termString));
if (stream.incrementToken()){
TermToBytesRefAttribute termAtt =
stream.getAttribute(TermToBytesRefAttribute.class);
BytesRef bytes = termAtt.getBytesRef();
return new Term(BytesRef.deepCopyOf(bytes));
} else
return null;
// TODO: Close the StringReader
// TODO: Handle terms that analyze into multiple terms (e.g., embedded
punctuation)
}

-- Jack Krupansky

-----Original Message-----
From: Bill Chesky
Sent: Friday, August 03, 2012 2:55 PM
To: java-user [at] lucene
Subject: RE: Analyzer on query question

Ian/Jack,

Ok, thanks for the help. I certainly don't want to take a cheap way out,
hence my original question about whether this is the right way to do this.
Jack, you say the right way is to do Term analysis before creating the Term.
If anybody has any information on how to accomplish this I'd greatly
appreciate it.

regards,

Bill

-----Original Message-----
From: Jack Krupansky [mailto:jack [at] basetechnology]
Sent: Friday, August 03, 2012 1:22 PM
To: java-user [at] lucene
Subject: Re: Analyzer on query question

Bill, the re-parse of Query.toString will work provided that your query
terms are either un-analyzed or their analyzer is "idempotent" (can be
applied repeatedly without changing the output terms.) In your case, you are
doing the former.

The bottom line: 1) if it works for you, great, 2) for other readers, please
do not depend on this approach if your input data is filtered in any way -
if your index analyzer "filters" terms (e.g, stemming, case changes,
term-splitting), your Term/TermQuery should be analyzed/filtered comparably,
in which case the extra parse (to cause term analysis such as stemming)
becomes unnecessary and risky if you are not very careful or very lucky.

-- Jack Krupansky

-----Original Message-----
From: Ian Lea
Sent: Friday, August 03, 2012 1:12 PM
To: java-user [at] lucene
Subject: Re: Analyzer on query question

Bill


You're getting the snowball stemming either way which I guess is good,
and if you get same results either way maybe it doesn't matter which
technique you use. I'd be a bit worried about parsing the result of
query.toString() because you aren't guaranteed to get back, in text,
what you put in.

My way seems better to me, but then it would. If you prefer your way
I won't argue with you.


--
Ian.


On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky <Bill.Chesky [at] learninga-z>
wrote:
> Ian,
>
> I gave this method a try, at least the way I understood your suggestion.
> E.g. to search for the phrase "cells combine" I built up a string like:
>
> title:"cells combine" description:"cells combine" text:"cells combine"
>
> then I passed that to the queryParser.parse() method (where queryParser is
> an instance of QueryParser constructed using SnowballAnalyzer) and added
> the result as a MUST clause in my final BooleanQuery.
>
> When I print the resulting query out as a string I get:
>
> +(title:"cell combin" description:"cell combin" keywords:"cell combin")
>
> So it looks like the SnowballAnalyzer is doing some stemming for me. But
> this is the exact same result I'd get doing it the way I described in my
> original email. I just built the unanalyzed string on my own rather than
> using the various query classes like PhraseQuery, etc.
>
> So I don't see the advantage to doing it this way over the original
> method. I just don't know if the original way I described is wrong or
> will give me bad results.
>
> thanks for the help,
>
> Bill
>
> -----Original Message-----
> From: Ian Lea [mailto:ian.lea [at] gmail]
> Sent: Friday, August 03, 2012 9:32 AM
> To: java-user [at] lucene
> Subject: Re: Analyzer on query question
>
> You can add parsed queries to a BooleanQuery. Would that help in this
> case?
>
> SnowballAnalyzer sba = whatever();
> QueryParser qp = new QueryParser(..., sba);
> Query q1 = qp.parse("some snowball string");
> Query q2 = qp.parse("some other snowball string");
>
> BooleanQuery bq = new BooleanQuery();
> bq.add(q1, ...);
> bq.add(q2, ...);
> bq.add(loads of other stuff);
>
>
> --
> ian.
>
>
> On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky <Bill.Chesky [at] learninga-z>
> wrote:
>> Thanks Simon,
>>
>> Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem
>> to have been introduced until 3.1.0. Similarly my version of Lucene does
>> not have a BooleanQuery.addClause(BooleanClause) method. Maybe you meant
>> BooleanQuery.add(BooleanClause).
>
>>
>> In any case, most of what you're doing there, I'm just not familiar with.
>> Seems very low level. I've never had to use TokenStreams to build a
>> query before and I'm not really sure what is going on there. Also, I
>> don't know what PositionIncrementAttribute is or how it would be used to
>> create a PhraseQuery. The way I'm currently creating PhraseQuerys is
>> very straightforward and intuitive. E.g. to search for the term "foo
>> bar" I'd build the query like this:
>>
>> PhraseQuery phraseQuery =
>> new PhraseQuery();
>> phraseQuery.add(new
>> Term("title", "foo"));
>> phraseQuery.add(new
>> Term("title", "bar"));
>>
>> Is there really no easier way to associate the correct analyzer with
>> these types of queries?
>>
>> Bill
>>
>> -----Original Message-----
>> From: Simon Willnauer [mailto:simon.willnauer [at] gmail]
>> Sent: Friday, August 03, 2012 3:43 AM
>> To: java-user [at] lucene; Bill Chesky
>> Subject: Re: Analyzer on query question
>>
>> On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky
>> <Bill.Chesky [at] learninga-z> wrote:
>>> Hi,
>>>
>>> I understand that generally speaking you should use the same analyzer on
>>> querying as was used on indexing. In my code I am using the
>>> SnowballAnalyzer on index creation. However, on the query side I am
>>> building up a complex BooleanQuery from other BooleanQuerys and/or
>>> PhraseQuerys on several fields. None of these require specifying an
>>> analyzer anywhere. This is causing some odd results, I think, because a
>>> different analyzer (or no analyzer?) is being used for the query.
>>>
>>> Question: how do I build my boolean and phrase queries using the
>>> SnowballAnalyzer?
>>>
>>> One thing I did that seemed to kind of work was to build my complex
>>> query normally then build a snowball-analyzed query using a QueryParser
>>> instantiated with a SnowballAnalyzer. To do this, I simply pass the
>>> string value of the complex query to the QueryParser.parse() method to
>>> get the new query. Something like this:
>>>
>>> // build a complex query from other BooleanQuerys and PhraseQuerys
>>> BooleanQuery fullQuery = buildComplexQuery();
>>> QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new
>>> SnowballAnalyzer(Version.LUCENE_30, "English"));
>>> Query snowballAnalyzedQuery = parser.parse(fullQuery.toString());
>>>
>>> TopScoreDocCollector collector = TopScoreDocCollector.create(10000,
>>> true);
>>> indexSearcher.search(snowballAnalyzedQuery, collector);
>>
>> you can just use the analyzer directly like this:
>> Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
>>
>> TokenStream stream = analyzer.tokenStream("title", new
>> StringReader(fullQuery.toString()):
>> CharTermAttribute termAttr =
>> stream.addAttribute(CharTermAttribute.class);
>> stream.reset();
>> BooleanQuery q = new BooleanQuery();
>> while(stream.incrementToken()) {
>> q.addClause(new BooleanClause(Occur.MUST, new Term("title",
>> termAttr.toString())));
>> }
>>
>> you also have access to the token positions if you want to create
>> phrase queries etc. just add a PositionIncrementAttribute like this:
>> PositionIncrementAttribute posAttr =
>> stream.addAttribute(PositionsIncrementAttribute.class);
>>
>> pls. doublecheck the code it's straight from the top of my head.
>>
>> simon
>>
>>>
>>> Like I said, this seems to kind of work but it doesn't feel right. Does
>>> this make sense? Is there a better way?
>>>
>>> thanks in advance,
>>>
>>> Bill
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Subject User Time
Analyzer on query question Bill.Chesky at learninga-z Aug 2, 2012, 2:09 PM
    Re: Analyzer on query question simon.willnauer at gmail Aug 3, 2012, 12:42 AM
    RE: Analyzer on query question Bill.Chesky at learninga-z Aug 3, 2012, 6:19 AM
    Re: Analyzer on query question ian.lea at gmail Aug 3, 2012, 6:31 AM
    Re: Analyzer on query question jack at basetechnology Aug 3, 2012, 6:32 AM
        RE: Analyzer on query question Bill.Chesky at learninga-z Aug 3, 2012, 9:53 AM
    RE: Analyzer on query question Bill.Chesky at learninga-z Aug 3, 2012, 9:57 AM
    Re: Analyzer on query question ian.lea at gmail Aug 3, 2012, 10:12 AM
    Re: Analyzer on query question jack at basetechnology Aug 3, 2012, 10:22 AM
        RE: Analyzer on query question Bill.Chesky at learninga-z Aug 3, 2012, 11:55 AM
    Re: Analyzer on query question jack at basetechnology Aug 3, 2012, 1:03 PM
        Re: Analyzer on query question rcmuir at gmail Aug 3, 2012, 1:39 PM
        RE: Analyzer on query question Bill.Chesky at learninga-z Aug 3, 2012, 2:35 PM
    Re: Analyzer on query question ian.lea at gmail Aug 3, 2012, 2:03 PM
    Re: Analyzer on query question jack at basetechnology Aug 3, 2012, 2:48 PM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.