Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Query processing with Lucene

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


celikik at gmail

Jan 6, 2008, 4:13 AM

Post #1 of 5 (904 views)
Permalink
Query processing with Lucene

Dear all,

Maybe this topic is already discussed (then can I get a reference
please?)... I would like to know how does Lucene actually process the
query. For example, take a 2-word query "x y". Does Lucene fetch the
lists of "x" and "y" and intersect them, or do they do something more
fancy, for example, top-k techniques that try to avoid a full scan of
the index lists for "x" and "y" ?

Marjan.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


cdoronc at gmail

Jan 8, 2008, 11:29 AM

Post #2 of 5 (817 views)
Permalink
Re: Query processing with Lucene [In reply to]

Hi Marjan,

Lucene process the query in what can be called
one-doc-at-a-time.

For the example query - x y - (not the phrase query "x y") - all
documents containing either x or y are considered a match.

When processing the query - x y - the posting lists of these two
index terms are traversed, and for each document met on the way,
a score is computed (taking into account both terms), and "collected".
At the end of the traversal, usually best N collected docs are returned as
search result. So, this is an exhaustive computation creating a union of
the two posting. For the query - +x +y - in intersection rather than
union is required, and the way Lucene does it is again to traverse
the two posting lists, just that only documents seen in both lists
are scored and collected. This allows to optimize the search,
skipping large chunks of the posting lists, especially when
one term is rarer than the other.

You can read more on Lucene scoring in Lucene's documentation,
http://lucene.apache.org/java/docs/scoring.html is a good starting
point,

HTH,
Doron

On Jan 6, 2008 2:13 PM, Marjan Celikik <celikik [at] gmail> wrote:

> Dear all,
>
> Maybe this topic is already discussed (then can I get a reference
> please?)... I would like to know how does Lucene actually process the
> query. For example, take a 2-word query "x y". Does Lucene fetch the
> lists of "x" and "y" and intersect them, or do they do something more
> fancy, for example, top-k techniques that try to avoid a full scan of
> the index lists for "x" and "y" ?
>
> Marjan.
>


celikik at gmail

Jan 8, 2008, 1:24 PM

Post #3 of 5 (828 views)
Permalink
Re: Query processing with Lucene [In reply to]

Doron Cohen wrote:
> Hi Marjan,
>
> Lucene process the query in what can be called
> one-doc-at-a-time.
>
> For the example query - x y - (not the phrase query "x y") - all
> documents containing either x or y are considered a match.
>
> When processing the query - x y - the posting lists of these two
> index terms are traversed, and for each document met on the way,
> a score is computed (taking into account both terms), and "collected".
> At the end of the traversal, usually best N collected docs are returned as
> search result. So, this is an exhaustive computation creating a union of
> the two posting. For the query - +x +y - in intersection rather than
> union is required, and the way Lucene does it is again to traverse
> the two posting lists, just that only documents seen in both lists
> are scored and collected. This allows to optimize the search,
> skipping large chunks of the posting lists, especially when
> one term is rarer than the other.
>
Thank you for your answer.

I am having trouble finding the function which traverses the documents
such that they get scored. Can you
please tell me where the posting lists (for a +x +y query) get
intersected after they get read (by next() I guess)
from the index?

In particular, I am interested in how does Lucene get the new positions
(offsets) of the documents seen
in both posting lists, i.e. positions (in a document) for the query word
x, and positions for the query word y.

Thank you in advance!

Marjan.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


cdoronc at gmail

Jan 8, 2008, 1:49 PM

Post #4 of 5 (809 views)
Permalink
Re: Query processing with Lucene [In reply to]

This is done by Lucene's scorers. You should however start
in http://lucene.apache.org/java/docs/scoring.html, - scorers
are described in the "Algorithm" section. "Offsets" are used
by Phrase Scorers and by Span Scorer.

Doron

On Jan 8, 2008 11:24 PM, Marjan Celikik < celikik [at] gmail> wrote:

> Doron Cohen wrote:
> > Hi Marjan,
> >
> > Lucene process the query in what can be called
> > one-doc-at-a-time.
> >
> > For the example query - x y - (not the phrase query "x y") - all
> > documents containing either x or y are considered a match.
> >
> > When processing the query - x y - the posting lists of these two
> > index terms are traversed, and for each document met on the way,
> > a score is computed (taking into account both terms), and "collected".
> > At the end of the traversal, usually best N collected docs are returned
> as
> > search result. So, this is an exhaustive computation creating a union of
> > the two posting. For the query - +x +y - in intersection rather than
> > union is required, and the way Lucene does it is again to traverse
> > the two posting lists, just that only documents seen in both lists
> > are scored and collected. This allows to optimize the search,
> > skipping large chunks of the posting lists, especially when
> > one term is rarer than the other.
> >
> Thank you for your answer.
>
> I am having trouble finding the function which traverses the documents
> such that they get scored. Can you
> please tell me where the posting lists (for a +x +y query) get
> intersected after they get read (by next() I guess)
> from the index?
>
> In particular, I am interested in how does Lucene get the new positions
> (offsets) of the documents seen
> in both posting lists, i.e. positions (in a document) for the query word
> x, and positions for the query word y.
>
> Thank you in advance!
>
> Marjan.
>


paul.elschot at xs4all

Jan 9, 2008, 6:49 AM

Post #5 of 5 (810 views)
Permalink
Re: Query processing with Lucene [In reply to]

On Tuesday 08 January 2008 22:49:18 Doron Cohen wrote:
> This is done by Lucene's scorers. You should however start
> in http://lucene.apache.org/java/docs/scoring.html, - scorers
> are described in the "Algorithm" section. "Offsets" are used
> by Phrase Scorers and by Span Scorer.

That is for the case that offsets were meant to be positions
within a document.

It is also possible that offsets were meant in the sense of using
skipTo(doc) instead of next() on a Scorer. This is done during
query search when at least one term is required.

Regards,
Paul Elschot


>
> Doron
>
> On Jan 8, 2008 11:24 PM, Marjan Celikik < celikik [at] gmail> wrote:
>
> > Doron Cohen wrote:
> > > Hi Marjan,
> > >
> > > Lucene process the query in what can be called
> > > one-doc-at-a-time.
> > >
> > > For the example query - x y - (not the phrase query "x y") - all
> > > documents containing either x or y are considered a match.
> > >
> > > When processing the query - x y - the posting lists of these two
> > > index terms are traversed, and for each document met on the way,
> > > a score is computed (taking into account both terms), and "collected".
> > > At the end of the traversal, usually best N collected docs are returned
> > as
> > > search result. So, this is an exhaustive computation creating a union of
> > > the two posting. For the query - +x +y - in intersection rather than
> > > union is required, and the way Lucene does it is again to traverse
> > > the two posting lists, just that only documents seen in both lists
> > > are scored and collected. This allows to optimize the search,
> > > skipping large chunks of the posting lists, especially when
> > > one term is rarer than the other.
> > >
> > Thank you for your answer.
> >
> > I am having trouble finding the function which traverses the documents
> > such that they get scored. Can you
> > please tell me where the posting lists (for a +x +y query) get
> > intersected after they get read (by next() I guess)
> > from the index?
> >
> > In particular, I am interested in how does Lucene get the new positions
> > (offsets) of the documents seen
> > in both posting lists, i.e. positions (in a document) for the query word
> > x, and positions for the query word y.
> >
> > Thank you in advance!
> >
> > Marjan.
> >
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.