Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Use of AllTermDocs with custom scorer

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


peterlkeegan at gmail

Nov 16, 2009, 10:39 AM

Post #1 of 11 (1214 views)
Permalink
Use of AllTermDocs with custom scorer

I have a custom query object whose scorer uses the 'AllTermDocs' to get all
non-deleted documents. AllTermDocs returns the docId relative to the
segment, but I need the absolute (index-wide) docId to access external data.
What's the best way to get the unique, non-deleted docId?

Thanks,
Peter


peterlkeegan at gmail

Nov 16, 2009, 11:06 AM

Post #2 of 11 (1207 views)
Permalink
Re: Use of AllTermDocs with custom scorer [In reply to]

I forgot to mention that this is with V2.9.1

On Mon, Nov 16, 2009 at 1:39 PM, Peter Keegan <peterlkeegan [at] gmail>wrote:

> I have a custom query object whose scorer uses the 'AllTermDocs' to get all
> non-deleted documents. AllTermDocs returns the docId relative to the
> segment, but I need the absolute (index-wide) docId to access external data.
> What's the best way to get the unique, non-deleted docId?
>
> Thanks,
> Peter
>


peterlkeegan at gmail

Nov 16, 2009, 11:50 AM

Post #3 of 11 (1207 views)
Permalink
Re: Use of AllTermDocs with custom scorer [In reply to]

The same thing is occurring in my custom sort comparator. The ScoreDocs
passed to the 'compare' method have docIds that seem to be relative to the
segment. Is there any way to translate these into index-wide docIds?

Peter

On Mon, Nov 16, 2009 at 2:06 PM, Peter Keegan <peterlkeegan [at] gmail>wrote:

> I forgot to mention that this is with V2.9.1
>
>
> On Mon, Nov 16, 2009 at 1:39 PM, Peter Keegan <peterlkeegan [at] gmail>wrote:
>
>> I have a custom query object whose scorer uses the 'AllTermDocs' to get
>> all non-deleted documents. AllTermDocs returns the docId relative to the
>> segment, but I need the absolute (index-wide) docId to access external data.
>> What's the best way to get the unique, non-deleted docId?
>>
>> Thanks,
>> Peter
>>
>
>


lucene at mikemccandless

Nov 16, 2009, 2:16 PM

Post #4 of 11 (1177 views)
Permalink
Re: Use of AllTermDocs with custom scorer [In reply to]

Can you remap your external data to be per segment? Presumably hat
would make reopens faster for your app.

For your custom sort comparator, are you using FieldComparator? If
so, Lucene calls setNextReader to tell you the reader & docBase.

Failing these, Lucene currently visits the readers in index order.
So, you could accumulate the docBase by adding up the reader.maxDoc()
for each reader you've seen. However, this may change in future
Lucene releases.

You could also, externally, build your own map from SegmentReader ->
docBase, by calling IndexReader.getSequentialSubReaders() and stepping
through adding up the maxDoc. Then, in your search, you can lookup
the SegmentReader you're working on to get the docBase?

Mike

On Mon, Nov 16, 2009 at 2:50 PM, Peter Keegan <peterlkeegan [at] gmail> wrote:
> The same thing is occurring in my custom sort comparator. The ScoreDocs
> passed to the 'compare' method have docIds that seem to be relative to the
> segment. Is there any way to translate these into index-wide docIds?
>
> Peter
>
> On Mon, Nov 16, 2009 at 2:06 PM, Peter Keegan <peterlkeegan [at] gmail>wrote:
>
>> I forgot to mention that this is with V2.9.1
>>
>>
>> On Mon, Nov 16, 2009 at 1:39 PM, Peter Keegan <peterlkeegan [at] gmail>wrote:
>>
>>> I have a custom query object whose scorer uses the 'AllTermDocs' to get
>>> all non-deleted documents. AllTermDocs returns the docId relative to the
>>> segment, but I need the absolute (index-wide) docId to access external data.
>>> What's the best way to get the unique, non-deleted docId?
>>>
>>> Thanks,
>>> Peter
>>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


peterlkeegan at gmail

Nov 16, 2009, 3:38 PM

Post #5 of 11 (1166 views)
Permalink
Re: Use of AllTermDocs with custom scorer [In reply to]

>Can you remap your external data to be per segment?

That would provide the tightest integration but would require a major
redesign. Currently, the external data is in a single file created by
reading a stored field after the Lucene index has been committed. Creating
this file is very fast with 2.9 (considering the cost of reading all those
stored fields).

>For your custom sort comparator, are you using FieldComparator?

I'm using the deprecated FieldSortedHitQueue. I started looking into
replacing it with FieldComparator, but it was much more involved than I had
expected, so I postponed. Also, this would only be a partial solution to a
query with a custom scorer and custom sorter.

>Failing these, Lucene currently visits the readers in index order.
>So, you could accumulate the docBase by adding up the reader.maxDoc()
>for each reader you've seen. However, this may change in future
>Lucene releases.

This would work for the Scorer but not the Sorter, right?

>You could also, externally, build your own map from SegmentReader ->
>docBase, by calling IndexReader.getSequentialSubReaders() and stepping
>through adding up the maxDoc. Then, in your search, you can lookup
>the SegmentReader you're working on to get the docBase?

I think this would work for both Scorer and Sorter, right?
This seems like the best solution right now.

Thanks for good suggestions!

Peter

On Mon, Nov 16, 2009 at 5:16 PM, Michael McCandless <
lucene [at] mikemccandless> wrote:

> Can you remap your external data to be per segment? Presumably hat
> would make reopens faster for your app.
>
> For your custom sort comparator, are you using FieldComparator? If
> so, Lucene calls setNextReader to tell you the reader & docBase.
>
> Failing these, Lucene currently visits the readers in index order.
> So, you could accumulate the docBase by adding up the reader.maxDoc()
> for each reader you've seen. However, this may change in future
> Lucene releases.
>
> You could also, externally, build your own map from SegmentReader ->
> docBase, by calling IndexReader.getSequentialSubReaders() and stepping
> through adding up the maxDoc. Then, in your search, you can lookup
> the SegmentReader you're working on to get the docBase?
>
> Mike
>
> On Mon, Nov 16, 2009 at 2:50 PM, Peter Keegan <peterlkeegan [at] gmail>
> wrote:
> > The same thing is occurring in my custom sort comparator. The ScoreDocs
> > passed to the 'compare' method have docIds that seem to be relative to
> the
> > segment. Is there any way to translate these into index-wide docIds?
> >
> > Peter
> >
> > On Mon, Nov 16, 2009 at 2:06 PM, Peter Keegan <peterlkeegan [at] gmail
> >wrote:
> >
> >> I forgot to mention that this is with V2.9.1
> >>
> >>
> >> On Mon, Nov 16, 2009 at 1:39 PM, Peter Keegan <peterlkeegan [at] gmail
> >wrote:
> >>
> >>> I have a custom query object whose scorer uses the 'AllTermDocs' to get
> >>> all non-deleted documents. AllTermDocs returns the docId relative to
> the
> >>> segment, but I need the absolute (index-wide) docId to access external
> data.
> >>> What's the best way to get the unique, non-deleted docId?
> >>>
> >>> Thanks,
> >>> Peter
> >>>
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


lucene at mikemccandless

Nov 17, 2009, 2:49 AM

Post #6 of 11 (1156 views)
Permalink
Re: Use of AllTermDocs with custom scorer [In reply to]

On Mon, Nov 16, 2009 at 6:38 PM, Peter Keegan <peterlkeegan [at] gmail> wrote:

>>Can you remap your external data to be per segment?
>
> That would provide the tightest integration but would require a major
> redesign. Currently, the external data is in a single file created by
> reading a stored field after the Lucene index has been committed. Creating
> this file is very fast with 2.9 (considering the cost of reading all those
> stored fields).

OK. Though if you update a few docs and open a new reader, you have
to fully recreate the file? (Or, your app may simply never need to do
that...).

>>For your custom sort comparator, are you using FieldComparator?
>
> I'm using the deprecated FieldSortedHitQueue. I started looking into
> replacing it with FieldComparator, but it was much more involved than I had
> expected, so I postponed. Also, this would only be a partial solution to a
> query with a custom scorer and custom sorter.

You are using FSHQ directly, yourself? (Ie, not via TopFieldDocCollector)?

FSHQ expects you to init it with the top-level reader, and then insert
using top docIDs.

>>Failing these, Lucene currently visits the readers in index order.
>>So, you could accumulate the docBase by adding up the reader.maxDoc()
>>for each reader you've seen. However, this may change in future
>>Lucene releases.
>
> This would work for the Scorer but not the Sorter, right?

I don't fully understand the question -- the sorter is simply a
Collector impl, and Collector.setNextReader tells you docBase when a
the search advances to the next reader.

>>You could also, externally, build your own map from SegmentReader ->
>>docBase, by calling IndexReader.getSequentialSubReaders() and stepping
>>through adding up the maxDoc. Then, in your search, you can lookup
>>the SegmentReader you're working on to get the docBase?
>
> I think this would work for both Scorer and Sorter, right?
> This seems like the best solution right now.

This is a generic solution, but just make sure you don't do the
map lookup for every doc collected, if you can help it, else that'll
slow down your search.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


peterlkeegan at gmail

Nov 17, 2009, 5:58 AM

Post #7 of 11 (1141 views)
Permalink
Re: Use of AllTermDocs with custom scorer [In reply to]

The external data is just an array of fixed-length records, one for each
Lucene document. Indexes are updated at regular intervals in one jvm. A
searcher jvm opens the index and reads all the fixed-length records into
RAM. Given an index-wide docId, the custom scorer can quickly access the
corresponding fixed-length external data.

Could you explain a bit more about how mapping the external data to be per
segment would work? As I said, rebuilding the whole file isn't a big deal
and the single file keeps the Searcher's use of it simple.

With or without a SegmentReader->docBase map (which does sound like a huge
performance hit), I still don't see how the custom scorer gets the segment
number. Btw, the custom scorer usually becomes part of a ConjunctionScorer
(if that matters)

>FSHQ expects you to init it with the top-level reader, and then insert
using top docIDs.
For sorting, I'm using FSHQ directly with a custom collector that inserts
docs to the FSHQ. But the custom collector is passed the segment-relative
docId and the custom comparator needs the index-wide docId. The custom
collector extends HitCollector. I'm missing where this type of collector
finds the docBase.

Thanks,
Peter

On Tue, Nov 17, 2009 at 5:49 AM, Michael McCandless <
lucene [at] mikemccandless> wrote:

> On Mon, Nov 16, 2009 at 6:38 PM, Peter Keegan <peterlkeegan [at] gmail>
> wrote:
>
> >>Can you remap your external data to be per segment?
> >
> > That would provide the tightest integration but would require a major
> > redesign. Currently, the external data is in a single file created by
> > reading a stored field after the Lucene index has been committed.
> Creating
> > this file is very fast with 2.9 (considering the cost of reading all
> those
> > stored fields).
>
> OK. Though if you update a few docs and open a new reader, you have
> to fully recreate the file? (Or, your app may simply never need to do
> that...).
>
> >>For your custom sort comparator, are you using FieldComparator?
> >
> > I'm using the deprecated FieldSortedHitQueue. I started looking into
> > replacing it with FieldComparator, but it was much more involved than I
> had
> > expected, so I postponed. Also, this would only be a partial solution to
> a
> > query with a custom scorer and custom sorter.
>
> You are using FSHQ directly, yourself? (Ie, not via TopFieldDocCollector)?
>
> FSHQ expects you to init it with the top-level reader, and then insert
> using top docIDs.
>
> >>Failing these, Lucene currently visits the readers in index order.
> >>So, you could accumulate the docBase by adding up the reader.maxDoc()
> >>for each reader you've seen. However, this may change in future
> >>Lucene releases.
> >
> > This would work for the Scorer but not the Sorter, right?
>
> I don't fully understand the question -- the sorter is simply a
> Collector impl, and Collector.setNextReader tells you docBase when a
> the search advances to the next reader.
>
> >>You could also, externally, build your own map from SegmentReader ->
> >>docBase, by calling IndexReader.getSequentialSubReaders() and stepping
> >>through adding up the maxDoc. Then, in your search, you can lookup
> >>the SegmentReader you're working on to get the docBase?
> >
> > I think this would work for both Scorer and Sorter, right?
> > This seems like the best solution right now.
>
> This is a generic solution, but just make sure you don't do the
> map lookup for every doc collected, if you can help it, else that'll
> slow down your search.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


peterlkeegan at gmail

Nov 17, 2009, 7:23 AM

Post #8 of 11 (1134 views)
Permalink
Re: Use of AllTermDocs with custom scorer [In reply to]

>This is a generic solution, but just make sure you don't do the
>map lookup for every doc collected, if you can help it, else that'll
>slow down your search.

What I just learned is that a Scorer is created for each segment (lights
on!).
So, couldn't I just do the subreader->docBase map lookup once when the
custom scorer is created? No need to access the map for every doc this way.

Peter

On Tue, Nov 17, 2009 at 8:58 AM, Peter Keegan <peterlkeegan [at] gmail>wrote:

> The external data is just an array of fixed-length records, one for each
> Lucene document. Indexes are updated at regular intervals in one jvm. A
> searcher jvm opens the index and reads all the fixed-length records into
> RAM. Given an index-wide docId, the custom scorer can quickly access the
> corresponding fixed-length external data.
>
> Could you explain a bit more about how mapping the external data to be per
> segment would work? As I said, rebuilding the whole file isn't a big deal
> and the single file keeps the Searcher's use of it simple.
>
> With or without a SegmentReader->docBase map (which does sound like a huge
> performance hit), I still don't see how the custom scorer gets the segment
> number. Btw, the custom scorer usually becomes part of a ConjunctionScorer
> (if that matters)
>
>
> >FSHQ expects you to init it with the top-level reader, and then insert
> using top docIDs.
> For sorting, I'm using FSHQ directly with a custom collector that inserts
> docs to the FSHQ. But the custom collector is passed the segment-relative
> docId and the custom comparator needs the index-wide docId. The custom
> collector extends HitCollector. I'm missing where this type of collector
> finds the docBase.
>
> Thanks,
> Peter
>
>
> On Tue, Nov 17, 2009 at 5:49 AM, Michael McCandless <
> lucene [at] mikemccandless> wrote:
>
>> On Mon, Nov 16, 2009 at 6:38 PM, Peter Keegan <peterlkeegan [at] gmail>
>> wrote:
>>
>> >>Can you remap your external data to be per segment?
>> >
>> > That would provide the tightest integration but would require a major
>> > redesign. Currently, the external data is in a single file created by
>> > reading a stored field after the Lucene index has been committed.
>> Creating
>> > this file is very fast with 2.9 (considering the cost of reading all
>> those
>> > stored fields).
>>
>> OK. Though if you update a few docs and open a new reader, you have
>> to fully recreate the file? (Or, your app may simply never need to do
>> that...).
>>
>> >>For your custom sort comparator, are you using FieldComparator?
>> >
>> > I'm using the deprecated FieldSortedHitQueue. I started looking into
>> > replacing it with FieldComparator, but it was much more involved than I
>> had
>> > expected, so I postponed. Also, this would only be a partial solution to
>> a
>> > query with a custom scorer and custom sorter.
>>
>> You are using FSHQ directly, yourself? (Ie, not via
>> TopFieldDocCollector)?
>>
>> FSHQ expects you to init it with the top-level reader, and then insert
>> using top docIDs.
>>
>> >>Failing these, Lucene currently visits the readers in index order.
>> >>So, you could accumulate the docBase by adding up the reader.maxDoc()
>> >>for each reader you've seen. However, this may change in future
>> >>Lucene releases.
>> >
>> > This would work for the Scorer but not the Sorter, right?
>>
>> I don't fully understand the question -- the sorter is simply a
>> Collector impl, and Collector.setNextReader tells you docBase when a
>> the search advances to the next reader.
>>
>> >>You could also, externally, build your own map from SegmentReader ->
>> >>docBase, by calling IndexReader.getSequentialSubReaders() and stepping
>> >>through adding up the maxDoc. Then, in your search, you can lookup
>> >>the SegmentReader you're working on to get the docBase?
>> >
>> > I think this would work for both Scorer and Sorter, right?
>> > This seems like the best solution right now.
>>
>> This is a generic solution, but just make sure you don't do the
>> map lookup for every doc collected, if you can help it, else that'll
>> slow down your search.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>


lucene at mikemccandless

Nov 17, 2009, 8:46 AM

Post #9 of 11 (1144 views)
Permalink
Re: Use of AllTermDocs with custom scorer [In reply to]

On Tue, Nov 17, 2009 at 10:23 AM, Peter Keegan <peterlkeegan [at] gmail> wrote:
>>This is a generic solution, but just make sure you don't do the
>>map lookup for every doc collected, if you can help it, else that'll
>>slow down your search.
>
> What I just learned is that a Scorer is created for each segment (lights
> on!).
> So, couldn't I just do the subreader->docBase map lookup once when the
> custom scorer is created? No need to access the map for every doc this way.

Right, that should work.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Nov 17, 2009, 8:51 AM

Post #10 of 11 (1127 views)
Permalink
Re: Use of AllTermDocs with custom scorer [In reply to]

On Tue, Nov 17, 2009 at 8:58 AM, Peter Keegan <peterlkeegan [at] gmail> wrote:
> The external data is just an array of fixed-length records, one for each
> Lucene document. Indexes are updated at regular intervals in one jvm. A
> searcher jvm opens the index and reads all the fixed-length records into
> RAM. Given an index-wide docId, the custom scorer can quickly access the
> corresponding fixed-length external data.
>
> Could you explain a bit more about how mapping the external data to be per
> segment would work? As I said, rebuilding the whole file isn't a big deal
> and the single file keeps the Searcher's use of it simple.

Well, you could use IndexReader.getSequentialSubReaders(), then step
through that array of SegmentReaders, making a seprate external file
for each?

This way, when you reopen your readers, you would only need to make a
new external file for those segments that are new.

But if re-creating the entire file on each reopen isn't a problem for
you then there's no need to change this :)

> With or without a SegmentReader->docBase map (which does sound like a huge
> performance hit), I still don't see how the custom scorer gets the segment
> number. Btw, the custom scorer usually becomes part of a ConjunctionScorer
> (if that matters)

Looks like you already answered this (Lucene asks the Query's weight
for a new scorer one segment at a time).

>>FSHQ expects you to init it with the top-level reader, and then insert
> using top docIDs.
> For sorting, I'm using FSHQ directly with a custom collector that inserts
> docs to the FSHQ. But the custom collector is passed the segment-relative
> docId and the custom comparator needs the index-wide docId. The custom
> collector extends HitCollector. I'm missing where this type of collector
> finds the docBase.

Hmm -- if you are extending HitCollector and passing that to search(),
then the docIDs fed to it should already be top-level docIDs, not
segment relative.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


peterlkeegan at gmail

Nov 17, 2009, 9:37 AM

Post #11 of 11 (1157 views)
Permalink
Re: Use of AllTermDocs with custom scorer [In reply to]

> But if re-creating the entire file on each reopen isn't a problem for
> you then there's no need to change this :)

It's actually created after IndexWriter.commit(), but same idea. If we
needed real-time indexing, or if disk I/O gets excessive, I'd go with
separate files per segment.

>Hmm -- if you are extending HitCollector and passing that to search(),
>then the docIDs fed to it should already be top-level docIDs, not
>segment relative.

I just assumed the same was true for the collector, but you're right. The
incorrect sorting I see must be due to something else.

Thanks,
Peter

On Tue, Nov 17, 2009 at 11:51 AM, Michael McCandless <
lucene [at] mikemccandless> wrote:

> On Tue, Nov 17, 2009 at 8:58 AM, Peter Keegan <peterlkeegan [at] gmail>
> wrote:
> > The external data is just an array of fixed-length records, one for each
> > Lucene document. Indexes are updated at regular intervals in one jvm. A
> > searcher jvm opens the index and reads all the fixed-length records into
> > RAM. Given an index-wide docId, the custom scorer can quickly access the
> > corresponding fixed-length external data.
> >
> > Could you explain a bit more about how mapping the external data to be
> per
> > segment would work? As I said, rebuilding the whole file isn't a big deal
> > and the single file keeps the Searcher's use of it simple.
>
> Well, you could use IndexReader.getSequentialSubReaders(), then step
> through that array of SegmentReaders, making a seprate external file
> for each?
>
> This way, when you reopen your readers, you would only need to make a
> new external file for those segments that are new.
>
> But if re-creating the entire file on each reopen isn't a problem for
> you then there's no need to change this :)
>
> > With or without a SegmentReader->docBase map (which does sound like a
> huge
> > performance hit), I still don't see how the custom scorer gets the
> segment
> > number. Btw, the custom scorer usually becomes part of a
> ConjunctionScorer
> > (if that matters)
>
> Looks like you already answered this (Lucene asks the Query's weight
> for a new scorer one segment at a time).
>
> >>FSHQ expects you to init it with the top-level reader, and then insert
> > using top docIDs.
> > For sorting, I'm using FSHQ directly with a custom collector that inserts
> > docs to the FSHQ. But the custom collector is passed the segment-relative
> > docId and the custom comparator needs the index-wide docId. The custom
> > collector extends HitCollector. I'm missing where this type of collector
> > finds the docBase.
>
> Hmm -- if you are extending HitCollector and passing that to search(),
> then the docIDs fed to it should already be top-level docIDs, not
> segment relative.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.