Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Extracting span terms using WeightedSpanTermExtractor

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


jahan9 at gmail

Jul 6, 2011, 2:34 PM

Post #1 of 7 (380 views)
Permalink
Extracting span terms using WeightedSpanTermExtractor

I have a CustomHighlighter that extends the SolrHighlighter and overrides
the doHighlighting() method. Then for each document I am trying to extract
the span terms so that later I can use it to get the span Positions. I tried
to get the weightedSpanTerms using WeightedSpanTermExtractor but was
unsuccessful. Below is the code that I am have. Is there something missing
that needs to be added to get the span terms?

// in CustomHighlighter.java
@Override
public NamedList doHighlighting(DocList docs, Query query, SolrQueryRequest
req, String[] defaultFields) throws IOException {

NamedList highlightedSnippets = super.doHighlighting(docs, query, req,
defaultFields);

IndexReader reader = req.getSearcher().getIndexReader();

String[] fieldNames = getHighlightFields(query, req, defaultFields);
for (String fieldName : fieldNames)
{
QueryScorer scorer = new QueryScorer(query, null);
scorer.setExpandMultiTermQuery(true);
scorer.setMaxDocCharsToAnalyze(51200);

DocIterator iterator = docs.iterator();
for (int i = 0; i < docs.size(); i++)
{
int docId = iterator.nextDoc();
System.out.println("DocId: " + docId);
TokenStream tokenStream = TokenSources.getTokenStream(reader, docId,
fieldName);
WeightedSpanTermExtractor wste = new WeightedSpanTermExtractor(fieldName);
wste.setExpandMultiTermQuery(true);
wste.setWrapIfNotCachingTokenFilter(true);

Map<String,WeightedSpanTerm> weightedSpanTerms =
wste.getWeightedSpanTerms(query, tokenStream, fieldName); // this is always
empty
System.out.println("weightedSpanTerms: " + weightedSpanTerms.values());

}
}
return highlightedSnippets;

}

Thanks,
Jahangir


sokolov at ifactory

Jul 6, 2011, 5:28 PM

Post #2 of 7 (358 views)
Permalink
Re: Extracting span terms using WeightedSpanTermExtractor [In reply to]

I tried something similar, and failed - I think the API is lacking
there? My only advice is to vote for this:
https://issues.apache.org/jira/browse/LUCENE-2878 which should provide
an alternative better API, but it's not near completion.

-Mike

On 7/6/2011 5:34 PM, Jahangir Anwari wrote:
> I have a CustomHighlighter that extends the SolrHighlighter and overrides
> the doHighlighting() method. Then for each document I am trying to extract
> the span terms so that later I can use it to get the span Positions. I tried
> to get the weightedSpanTerms using WeightedSpanTermExtractor but was
> unsuccessful. Below is the code that I am have. Is there something missing
> that needs to be added to get the span terms?
>
> // in CustomHighlighter.java
> @Override
> public NamedList doHighlighting(DocList docs, Query query, SolrQueryRequest
> req, String[] defaultFields) throws IOException {
>
> NamedList highlightedSnippets = super.doHighlighting(docs, query, req,
> defaultFields);
>
> IndexReader reader = req.getSearcher().getIndexReader();
>
> String[] fieldNames = getHighlightFields(query, req, defaultFields);
> for (String fieldName : fieldNames)
> {
> QueryScorer scorer = new QueryScorer(query, null);
> scorer.setExpandMultiTermQuery(true);
> scorer.setMaxDocCharsToAnalyze(51200);
>
> DocIterator iterator = docs.iterator();
> for (int i = 0; i< docs.size(); i++)
> {
> int docId = iterator.nextDoc();
> System.out.println("DocId: " + docId);
> TokenStream tokenStream = TokenSources.getTokenStream(reader, docId,
> fieldName);
> WeightedSpanTermExtractor wste = new WeightedSpanTermExtractor(fieldName);
> wste.setExpandMultiTermQuery(true);
> wste.setWrapIfNotCachingTokenFilter(true);
>
> Map<String,WeightedSpanTerm> weightedSpanTerms =
> wste.getWeightedSpanTerms(query, tokenStream, fieldName); // this is always
> empty
> System.out.println("weightedSpanTerms: " + weightedSpanTerms.values());
>
> }
> }
> return highlightedSnippets;
>
> }
>
> Thanks,
> Jahangir
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


markrmiller at gmail

Jul 6, 2011, 6:40 PM

Post #3 of 7 (353 views)
Permalink
Re: Extracting span terms using WeightedSpanTermExtractor [In reply to]

Sorry - kind of my fault. When I fixed this to use maxDocCharsToAnalyze, I didn't set a default other than 0 because I didn't really count on this being used beyond how it is in the Highlighter - which always sets maxDocCharsToAnalyze with it's default.

You've got to explicitly set it higher than 0 for now.

Feel free to create a JIRA issue and we can give it's own default greater than 0.

- Mark Miller
lucidimagination.com


On Jul 6, 2011, at 5:34 PM, Jahangir Anwari wrote:

> I have a CustomHighlighter that extends the SolrHighlighter and overrides
> the doHighlighting() method. Then for each document I am trying to extract
> the span terms so that later I can use it to get the span Positions. I tried
> to get the weightedSpanTerms using WeightedSpanTermExtractor but was
> unsuccessful. Below is the code that I am have. Is there something missing
> that needs to be added to get the span terms?
>
> // in CustomHighlighter.java
> @Override
> public NamedList doHighlighting(DocList docs, Query query, SolrQueryRequest
> req, String[] defaultFields) throws IOException {
>
> NamedList highlightedSnippets = super.doHighlighting(docs, query, req,
> defaultFields);
>
> IndexReader reader = req.getSearcher().getIndexReader();
>
> String[] fieldNames = getHighlightFields(query, req, defaultFields);
> for (String fieldName : fieldNames)
> {
> QueryScorer scorer = new QueryScorer(query, null);
> scorer.setExpandMultiTermQuery(true);
> scorer.setMaxDocCharsToAnalyze(51200);
>
> DocIterator iterator = docs.iterator();
> for (int i = 0; i < docs.size(); i++)
> {
> int docId = iterator.nextDoc();
> System.out.println("DocId: " + docId);
> TokenStream tokenStream = TokenSources.getTokenStream(reader, docId,
> fieldName);
> WeightedSpanTermExtractor wste = new WeightedSpanTermExtractor(fieldName);
> wste.setExpandMultiTermQuery(true);
> wste.setWrapIfNotCachingTokenFilter(true);
>
> Map<String,WeightedSpanTerm> weightedSpanTerms =
> wste.getWeightedSpanTerms(query, tokenStream, fieldName); // this is always
> empty
> System.out.println("weightedSpanTerms: " + weightedSpanTerms.values());
>
> }
> }
> return highlightedSnippets;
>
> }
>
> Thanks,
> Jahangir











---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


jahan9 at gmail

Jul 7, 2011, 2:14 PM

Post #4 of 7 (347 views)
Permalink
Re: Extracting span terms using WeightedSpanTermExtractor [In reply to]

Thanks Mark. After setting maxDocCharsToAnalyze to a value greater than 0, I
can now extract the span terms.

I did noticed a strange issue though. When the query is just a
PhraseQuery(e.g. "everlasting glory"), getWeightedSpanTerms() returns all
the span terms along with their span positions. But when the query is a
BooleanQuery containing phrase and non-phrase terms(e.g. "everlasting
glory"+unity), getWeightedSpanTerms() returns all the span terms but the
span positions are returned only for the phrase terms(i.e. "everlasting" and
"glory"). Span positions for the non-phrase term(i.e. "unity") is empty. Any
ideas why this could be happening?

-Jahangir

On Thu, Jul 7, 2011 at 4:40 AM, Mark Miller <markrmiller [at] gmail> wrote:

> Sorry - kind of my fault. When I fixed this to use maxDocCharsToAnalyze, I
> didn't set a default other than 0 because I didn't really count on this
> being used beyond how it is in the Highlighter - which always sets
> maxDocCharsToAnalyze with it's default.
>
> You've got to explicitly set it higher than 0 for now.
>
> Feel free to create a JIRA issue and we can give it's own default greater
> than 0.
>
> - Mark Miller
> lucidimagination.com
>
>
> On Jul 6, 2011, at 5:34 PM, Jahangir Anwari wrote:
>
> > I have a CustomHighlighter that extends the SolrHighlighter and overrides
> > the doHighlighting() method. Then for each document I am trying to
> extract
> > the span terms so that later I can use it to get the span Positions. I
> tried
> > to get the weightedSpanTerms using WeightedSpanTermExtractor but was
> > unsuccessful. Below is the code that I am have. Is there something
> missing
> > that needs to be added to get the span terms?
> >
> > // in CustomHighlighter.java
> > @Override
> > public NamedList doHighlighting(DocList docs, Query query,
> SolrQueryRequest
> > req, String[] defaultFields) throws IOException {
> >
> > NamedList highlightedSnippets = super.doHighlighting(docs, query, req,
> > defaultFields);
> >
> > IndexReader reader = req.getSearcher().getIndexReader();
> >
> > String[] fieldNames = getHighlightFields(query, req, defaultFields);
> > for (String fieldName : fieldNames)
> > {
> > QueryScorer scorer = new QueryScorer(query, null);
> > scorer.setExpandMultiTermQuery(true);
> > scorer.setMaxDocCharsToAnalyze(51200);
> >
> > DocIterator iterator = docs.iterator();
> > for (int i = 0; i < docs.size(); i++)
> > {
> > int docId = iterator.nextDoc();
> > System.out.println("DocId: " + docId);
> > TokenStream tokenStream = TokenSources.getTokenStream(reader, docId,
> > fieldName);
> > WeightedSpanTermExtractor wste = new
> WeightedSpanTermExtractor(fieldName);
> > wste.setExpandMultiTermQuery(true);
> > wste.setWrapIfNotCachingTokenFilter(true);
> >
> > Map<String,WeightedSpanTerm> weightedSpanTerms =
> > wste.getWeightedSpanTerms(query, tokenStream, fieldName); // this is
> always
> > empty
> > System.out.println("weightedSpanTerms: " + weightedSpanTerms.values());
> >
> > }
> > }
> > return highlightedSnippets;
> >
> > }
> >
> > Thanks,
> > Jahangir
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


markrmiller at gmail

Jul 7, 2011, 4:47 PM

Post #5 of 7 (345 views)
Permalink
Re: Extracting span terms using WeightedSpanTermExtractor [In reply to]

On Jul 7, 2011, at 5:14 PM, Jahangir Anwari wrote:

> I did noticed a strange issue though. When the query is just a
> PhraseQuery(e.g. "everlasting glory"), getWeightedSpanTerms() returns all
> the span terms along with their span positions. But when the query is a
> BooleanQuery containing phrase and non-phrase terms(e.g. "everlasting
> glory"+unity), getWeightedSpanTerms() returns all the span terms but the
> span positions are returned only for the phrase terms(i.e. "everlasting" and
> "glory"). Span positions for the non-phrase term(i.e. "unity") is empty. Any
> ideas why this could be happening?


Positions are only collected for "position sensitive" queries. The Highlighter framework that I plugged this into already runs through the TokenStream one token at a time - to highlight a TermQuery, there is no need to consult positions - just highlight every occurrence seen while marching through the TokenStream. Which means there is no need to find those positions either.

If you are looking for those positions, here is a patch to calculate them for TermQuerys as well. If you open a JIRA issue, seems like a reasonable option to add to the class.

Index: lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java
===================================================================
--- lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java (revision 1143407)
+++ lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java (working copy)
@@ -133,7 +133,7 @@
sp.setBoost(query.getBoost());
extractWeightedSpanTerms(terms, sp);
} else if (query instanceof TermQuery) {
- extractWeightedTerms(terms, query);
+ extractWeightedSpanTerms(terms, new SpanTermQuery(((TermQuery)query).getTerm()));
} else if (query instanceof SpanQuery) {
extractWeightedSpanTerms(terms, (SpanQuery) query);
} else if (query instanceof FilteredQuery) {


- Mark Miller
lucidimagination.com









---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


jahan9 at gmail

Jul 8, 2011, 2:43 AM

Post #6 of 7 (338 views)
Permalink
Re: Extracting span terms using WeightedSpanTermExtractor [In reply to]

After applying the patch I was able to get the span positions for all the
terms in the query. But now when I tried to access the positionSpans of each
span term I cannot because they are stored in a package-private PositionSpan
class in WeightedSpanTerm.java which prevents them from being visible
outside the package. I was able to work around it in a by making
PositionSpan utility class public. I moved PositionSpan class it into its
own file in o.a.l.search.highlight package. I don't think this is the best
solution, am open to other alternatives.

This is the content of o.a.l.search.highlight.PositionSpan.java

// Utility class to store a Span
public class PositionSpan {
int start;
int end;

public PositionSpan(int start, int end) {
this.start = start;
this.end = end;
}

public int getEndPositionSpan() {
return end;
}

public int getStartPositionSpan() {
return start;
}
}

-Jahangir

On Fri, Jul 8, 2011 at 2:47 AM, Mark Miller <markrmiller [at] gmail> wrote:

>
> On Jul 7, 2011, at 5:14 PM, Jahangir Anwari wrote:
>
> > I did noticed a strange issue though. When the query is just a
> > PhraseQuery(e.g. "everlasting glory"), getWeightedSpanTerms() returns all
> > the span terms along with their span positions. But when the query is a
> > BooleanQuery containing phrase and non-phrase terms(e.g. "everlasting
> > glory"+unity), getWeightedSpanTerms() returns all the span terms but the
> > span positions are returned only for the phrase terms(i.e. "everlasting"
> and
> > "glory"). Span positions for the non-phrase term(i.e. "unity") is empty.
> Any
> > ideas why this could be happening?
>
>
> Positions are only collected for "position sensitive" queries. The
> Highlighter framework that I plugged this into already runs through the
> TokenStream one token at a time - to highlight a TermQuery, there is no need
> to consult positions - just highlight every occurrence seen while marching
> through the TokenStream. Which means there is no need to find those
> positions either.
>
> If you are looking for those positions, here is a patch to calculate them
> for TermQuerys as well. If you open a JIRA issue, seems like a reasonable
> option to add to the class.
>
> Index:
> lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java
> ===================================================================
> ---
> lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java
> (revision 1143407)
> +++
> lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java
> (working copy)
> @@ -133,7 +133,7 @@
> sp.setBoost(query.getBoost());
> extractWeightedSpanTerms(terms, sp);
> } else if (query instanceof TermQuery) {
> - extractWeightedTerms(terms, query);
> + extractWeightedSpanTerms(terms, new
> SpanTermQuery(((TermQuery)query).getTerm()));
> } else if (query instanceof SpanQuery) {
> extractWeightedSpanTerms(terms, (SpanQuery) query);
> } else if (query instanceof FilteredQuery) {
>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


markrmiller at gmail

Jul 8, 2011, 8:41 AM

Post #7 of 7 (333 views)
Permalink
Re: Extracting span terms using WeightedSpanTermExtractor [In reply to]

On Jul 8, 2011, at 5:43 AM, Jahangir Anwari wrote:

> I don't think this is the best
> solution, am open to other alternatives.


Could also make it static public where it is? Either way.


- Mark Miller
lucidimagination.com









---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.