Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User
Re: Analyzer on query question
 

Index | Next | Previous | View Flat


rcmuir at gmail

Aug 3, 2012, 1:39 PM


Views: 1506
Permalink
Re: Analyzer on query question [In reply to]

you must call reset() before consuming any tokenstream.

On Fri, Aug 3, 2012 at 4:03 PM, Jack Krupansky <jack [at] basetechnology> wrote:
> Simon gave sample code for analyzing a multi-term string.
>
> Here's some pseudo-code (hasn't been compiled to check it) to analyze a
> single term with Lucene 3.6:
>
> public Term analyzeTerm(Analyzer analyzer, String termString){
> TokenStream stream = analyzer.tokenStream(field, new
> StringReader(termString));
> if (stream.incrementToken())
> return new
> Term(stream.getAttribute(CharacterTermAttribute.class).toString());
> else
> return null;
> // TODO: Close the StringReader
> // TODO: Handle terms that analyze into multiple terms (e.g., embedded
> punctuation)
> }
>
> And here's the corresponding code for Lucene 4.0:
>
> public Term analyzeTerm(Analyzer analyzer, String termString){
> TokenStream stream = analyzer.tokenStream(field, new
> StringReader(termString));
> if (stream.incrementToken()){
> TermToBytesRefAttribute termAtt =
> stream.getAttribute(TermToBytesRefAttribute.class);
> BytesRef bytes = termAtt.getBytesRef();
> return new Term(BytesRef.deepCopyOf(bytes));
> } else
> return null;
> // TODO: Close the StringReader
> // TODO: Handle terms that analyze into multiple terms (e.g., embedded
> punctuation)
> }
>
> -- Jack Krupansky
>
> -----Original Message----- From: Bill Chesky
> Sent: Friday, August 03, 2012 2:55 PM
> To: java-user [at] lucene
>
> Subject: RE: Analyzer on query question
>
> Ian/Jack,
>
> Ok, thanks for the help. I certainly don't want to take a cheap way out,
> hence my original question about whether this is the right way to do this.
> Jack, you say the right way is to do Term analysis before creating the Term.
> If anybody has any information on how to accomplish this I'd greatly
> appreciate it.
>
> regards,
>
> Bill
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack [at] basetechnology]
> Sent: Friday, August 03, 2012 1:22 PM
> To: java-user [at] lucene
> Subject: Re: Analyzer on query question
>
> Bill, the re-parse of Query.toString will work provided that your query
> terms are either un-analyzed or their analyzer is "idempotent" (can be
> applied repeatedly without changing the output terms.) In your case, you are
> doing the former.
>
> The bottom line: 1) if it works for you, great, 2) for other readers, please
> do not depend on this approach if your input data is filtered in any way -
> if your index analyzer "filters" terms (e.g, stemming, case changes,
> term-splitting), your Term/TermQuery should be analyzed/filtered comparably,
> in which case the extra parse (to cause term analysis such as stemming)
> becomes unnecessary and risky if you are not very careful or very lucky.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Ian Lea
> Sent: Friday, August 03, 2012 1:12 PM
> To: java-user [at] lucene
> Subject: Re: Analyzer on query question
>
> Bill
>
>
> You're getting the snowball stemming either way which I guess is good,
> and if you get same results either way maybe it doesn't matter which
> technique you use. I'd be a bit worried about parsing the result of
> query.toString() because you aren't guaranteed to get back, in text,
> what you put in.
>
> My way seems better to me, but then it would. If you prefer your way
> I won't argue with you.
>
>
> --
> Ian.
>
>
> On Fri, Aug 3, 2012 at 5:57 PM, Bill Chesky <Bill.Chesky [at] learninga-z>
> wrote:
>>
>> Ian,
>>
>> I gave this method a try, at least the way I understood your suggestion.
>> E.g. to search for the phrase "cells combine" I built up a string like:
>>
>> title:"cells combine" description:"cells combine" text:"cells combine"
>>
>> then I passed that to the queryParser.parse() method (where queryParser is
>> an instance of QueryParser constructed using SnowballAnalyzer) and added
>> the result as a MUST clause in my final BooleanQuery.
>>
>> When I print the resulting query out as a string I get:
>>
>> +(title:"cell combin" description:"cell combin" keywords:"cell combin")
>>
>> So it looks like the SnowballAnalyzer is doing some stemming for me. But
>> this is the exact same result I'd get doing it the way I described in my
>> original email. I just built the unanalyzed string on my own rather than
>> using the various query classes like PhraseQuery, etc.
>>
>> So I don't see the advantage to doing it this way over the original
>> method. I just don't know if the original way I described is wrong or
>> will give me bad results.
>>
>> thanks for the help,
>>
>> Bill
>>
>> -----Original Message-----
>> From: Ian Lea [mailto:ian.lea [at] gmail]
>> Sent: Friday, August 03, 2012 9:32 AM
>> To: java-user [at] lucene
>> Subject: Re: Analyzer on query question
>>
>> You can add parsed queries to a BooleanQuery. Would that help in this
>> case?
>>
>> SnowballAnalyzer sba = whatever();
>> QueryParser qp = new QueryParser(..., sba);
>> Query q1 = qp.parse("some snowball string");
>> Query q2 = qp.parse("some other snowball string");
>>
>> BooleanQuery bq = new BooleanQuery();
>> bq.add(q1, ...);
>> bq.add(q2, ...);
>> bq.add(loads of other stuff);
>>
>>
>> --
>> ian.
>>
>>
>> On Fri, Aug 3, 2012 at 2:19 PM, Bill Chesky <Bill.Chesky [at] learninga-z>
>> wrote:
>>>
>>> Thanks Simon,
>>>
>>> Unfortunately, I'm using Lucene 3.0.1 and CharTermAttribute doesn't seem
>>> to have been introduced until 3.1.0. Similarly my version of Lucene does
>>> not have a BooleanQuery.addClause(BooleanClause) method. Maybe you meant
>>> BooleanQuery.add(BooleanClause).
>>
>>
>>>
>>> In any case, most of what you're doing there, I'm just not familiar with.
>>> Seems very low level. I've never had to use TokenStreams to build a
>>> query before and I'm not really sure what is going on there. Also, I
>>> don't know what PositionIncrementAttribute is or how it would be used to
>>> create a PhraseQuery. The way I'm currently creating PhraseQuerys is
>>> very straightforward and intuitive. E.g. to search for the term "foo
>>> bar" I'd build the query like this:
>>>
>>> PhraseQuery phraseQuery =
>>> new PhraseQuery();
>>> phraseQuery.add(new
>>> Term("title", "foo"));
>>> phraseQuery.add(new
>>> Term("title", "bar"));
>>>
>>> Is there really no easier way to associate the correct analyzer with
>>> these types of queries?
>>>
>>> Bill
>>>
>>> -----Original Message-----
>>> From: Simon Willnauer [mailto:simon.willnauer [at] gmail]
>>> Sent: Friday, August 03, 2012 3:43 AM
>>> To: java-user [at] lucene; Bill Chesky
>>> Subject: Re: Analyzer on query question
>>>
>>> On Thu, Aug 2, 2012 at 11:09 PM, Bill Chesky
>>> <Bill.Chesky [at] learninga-z> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I understand that generally speaking you should use the same analyzer on
>>>> querying as was used on indexing. In my code I am using the
>>>> SnowballAnalyzer on index creation. However, on the query side I am
>>>> building up a complex BooleanQuery from other BooleanQuerys and/or
>>>> PhraseQuerys on several fields. None of these require specifying an
>>>> analyzer anywhere. This is causing some odd results, I think, because a
>>>> different analyzer (or no analyzer?) is being used for the query.
>>>>
>>>> Question: how do I build my boolean and phrase queries using the
>>>> SnowballAnalyzer?
>>>>
>>>> One thing I did that seemed to kind of work was to build my complex
>>>> query normally then build a snowball-analyzed query using a QueryParser
>>>> instantiated with a SnowballAnalyzer. To do this, I simply pass the
>>>> string value of the complex query to the QueryParser.parse() method to
>>>> get the new query. Something like this:
>>>>
>>>> // build a complex query from other BooleanQuerys and PhraseQuerys
>>>> BooleanQuery fullQuery = buildComplexQuery();
>>>> QueryParser parser = new QueryParser(Version.LUCENE_30, "title", new
>>>> SnowballAnalyzer(Version.LUCENE_30, "English"));
>>>> Query snowballAnalyzedQuery = parser.parse(fullQuery.toString());
>>>>
>>>> TopScoreDocCollector collector = TopScoreDocCollector.create(10000,
>>>> true);
>>>> indexSearcher.search(snowballAnalyzedQuery, collector);
>>>
>>>
>>> you can just use the analyzer directly like this:
>>> Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
>>>
>>> TokenStream stream = analyzer.tokenStream("title", new
>>> StringReader(fullQuery.toString()):
>>> CharTermAttribute termAttr =
>>> stream.addAttribute(CharTermAttribute.class);
>>> stream.reset();
>>> BooleanQuery q = new BooleanQuery();
>>> while(stream.incrementToken()) {
>>> q.addClause(new BooleanClause(Occur.MUST, new Term("title",
>>> termAttr.toString())));
>>> }
>>>
>>> you also have access to the token positions if you want to create
>>> phrase queries etc. just add a PositionIncrementAttribute like this:
>>> PositionIncrementAttribute posAttr =
>>> stream.addAttribute(PositionsIncrementAttribute.class);
>>>
>>> pls. doublecheck the code it's straight from the top of my head.
>>>
>>> simon
>>>
>>>>
>>>> Like I said, this seems to kind of work but it doesn't feel right. Does
>>>> this make sense? Is there a better way?
>>>>
>>>> thanks in advance,
>>>>
>>>> Bill
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Subject User Time
Analyzer on query question Bill.Chesky at learninga-z Aug 2, 2012, 2:09 PM
    Re: Analyzer on query question simon.willnauer at gmail Aug 3, 2012, 12:42 AM
    RE: Analyzer on query question Bill.Chesky at learninga-z Aug 3, 2012, 6:19 AM
    Re: Analyzer on query question ian.lea at gmail Aug 3, 2012, 6:31 AM
    Re: Analyzer on query question jack at basetechnology Aug 3, 2012, 6:32 AM
        RE: Analyzer on query question Bill.Chesky at learninga-z Aug 3, 2012, 9:53 AM
    RE: Analyzer on query question Bill.Chesky at learninga-z Aug 3, 2012, 9:57 AM
    Re: Analyzer on query question ian.lea at gmail Aug 3, 2012, 10:12 AM
    Re: Analyzer on query question jack at basetechnology Aug 3, 2012, 10:22 AM
        RE: Analyzer on query question Bill.Chesky at learninga-z Aug 3, 2012, 11:55 AM
    Re: Analyzer on query question jack at basetechnology Aug 3, 2012, 1:03 PM
        Re: Analyzer on query question rcmuir at gmail Aug 3, 2012, 1:39 PM
        RE: Analyzer on query question Bill.Chesky at learninga-z Aug 3, 2012, 2:35 PM
    Re: Analyzer on query question ian.lea at gmail Aug 3, 2012, 2:03 PM
    Re: Analyzer on query question jack at basetechnology Aug 3, 2012, 2:48 PM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.