Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Document-Ids and Merges

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


lucene_list at iconparc

Mar 27, 2012, 12:29 AM

Post #1 of 8 (475 views)
Permalink
Document-Ids and Merges

Hi all,

I have a search application with 16 million documents that uses custom
scores per document using a ValueSource. These values are updated a lot
(and sometimes all at once), so I can't really write them into the index
for performance reasons. Instead, I simply have a huge array of float
values in memory and use the document ID as index in the array.
This works great as long as the index is not changed, but as soon as I
have a few new documents and deletions, index segments are merged (I
suppose) and the document IDs of existing documents change. Is there any
way to be informed when document IDs of existing documents change? If
so, is there a way to calculate the new document ID from the old one, so
I can "convert" my array to the new document IDs?

Any help would be greatly appreciated!

Best regards,
Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Mar 27, 2012, 9:15 AM

Post #2 of 8 (476 views)
Permalink
Re: Document-Ids and Merges [In reply to]

In general how Lucene assigns docIDs is a volatile implementation
detail: it's free to change from release to release.

Eg, the default merge policy (TieredMergePolicy) merges out-of-order
segments. Another eg: at one point, IndexSearcher re-ordered the
segments on init. Another: because ConcurrentMergeScheduler runs
different merges in different threads, they can finish in different of
orders and thus alter how subsequent merges are selected.

Really it's best if you assign your own (app-level) ID field and use
that, if you need a stable ID.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Mar 27, 2012 at 3:29 AM, Christoph Kaser
<lucene_list [at] iconparc> wrote:
> Hi all,
>
> I have a search application with 16 million documents that uses custom
> scores per document using a ValueSource. These values are updated a lot (and
> sometimes all at once), so I can't really write them into the index for
> performance reasons. Instead, I simply have a huge array of float values in
> memory and use the document ID as index in the array.
> This works great as long as the index is not changed, but as soon as I have
> a few new documents and deletions, index segments are merged (I suppose) and
> the document IDs of existing documents change. Is there any way to be
> informed when document IDs of existing documents change? If so, is there a
> way to calculate the new document ID from the old one, so I can "convert" my
> array to the new document IDs?
>
> Any help would be greatly appreciated!
>
> Best regards,
> Christoph
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


serera at gmail

Mar 27, 2012, 12:16 PM

Post #3 of 8 (473 views)
Permalink
Re: Document-Ids and Merges [In reply to]

Or ... move to use a per-segment array. Then you don't need to rely on doc
IDs changing. You will need to build the array from the documents that are
in that segment only.

It's like FieldCache in a way. The array is relevant as long as the segment
exists (i.e. not merged away).

Hope this helps.

Shai
On Mar 27, 2012 9:29 AM, "Christoph Kaser" <lucene_list [at] iconparc> wrote:

> Hi all,
>
> I have a search application with 16 million documents that uses custom
> scores per document using a ValueSource. These values are updated a lot
> (and sometimes all at once), so I can't really write them into the index
> for performance reasons. Instead, I simply have a huge array of float
> values in memory and use the document ID as index in the array.
> This works great as long as the index is not changed, but as soon as I
> have a few new documents and deletions, index segments are merged (I
> suppose) and the document IDs of existing documents change. Is there any
> way to be informed when document IDs of existing documents change? If so,
> is there a way to calculate the new document ID from the old one, so I can
> "convert" my array to the new document IDs?
>
> Any help would be greatly appreciated!
>
> Best regards,
> Christoph
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene**apache.org<java-user-unsubscribe [at] lucene>
> For additional commands, e-mail: java-user-help [at] lucene**org<java-user-help [at] lucene>
>
>


christoph.kaser at iconparc

Mar 28, 2012, 12:34 AM

Post #4 of 8 (464 views)
Permalink
Re: Document-Ids and Merges [In reply to]

Hi Shai,

That sounds interesting. However, I am unsure how I can do this. Is
there a way to store values "with a segment"? How can I get the segment
from a document ID?
Here is how my ValueSource looks like at the moment:

public class MyScoreValues extends ValueSource {
float[] values=...; //float array with reader.maxDoc() entries

public DocValues getValues(IndexReader reader) throws IOException {
return new DocValues() {
public float floatVal(int doc) {
if(doc < values.length)
return values[doc];
return 1.0f;
}
};
}
}

How would I need to change it to make the arrays segment-based?

Best regards,
Christoph



Am 27.03.2012 21:16, schrieb Shai Erera:
> Or ... move to use a per-segment array. Then you don't need to rely on doc
> IDs changing. You will need to build the array from the documents that are
> in that segment only.
>
> It's like FieldCache in a way. The array is relevant as long as the segment
> exists (i.e. not merged away).
>
> Hope this helps.
>
> Shai
> On Mar 27, 2012 9:29 AM, "Christoph Kaser"<lucene_list [at] iconparc> wrote:
>
>> Hi all,
>>
>> I have a search application with 16 million documents that uses custom
>> scores per document using a ValueSource. These values are updated a lot
>> (and sometimes all at once), so I can't really write them into the index
>> for performance reasons. Instead, I simply have a huge array of float
>> values in memory and use the document ID as index in the array.
>> This works great as long as the index is not changed, but as soon as I
>> have a few new documents and deletions, index segments are merged (I
>> suppose) and the document IDs of existing documents change. Is there any
>> way to be informed when document IDs of existing documents change? If so,
>> is there a way to calculate the new document ID from the old one, so I can
>> "convert" my array to the new document IDs?
>>
>> Any help would be greatly appreciated!
>>
>> Best regards,
>> Christoph
>>
>> ------------------------------**------------------------------**---------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene**apache.org<java-user-unsubscribe [at] lucene>
>> For additional commands, e-mail: java-user-help [at] lucene**org<java-user-help [at] lucene>
>>
>>


--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


christoph.kaser at iconparc

Mar 28, 2012, 12:37 AM

Post #5 of 8 (470 views)
Permalink
Re: Document-Ids and Merges [In reply to]

Thank you for your answer!

That's too bad. I thought of using my own ID-field, but I wanted to save
the additional indirection (from docId to my ID to my value).
Do document IDs remain constant for one IndexReader as long as it isn't
reopened? If so, I could precalculate the indirection.

Best regards,
Christoph

Am 27.03.2012 18:15, schrieb Michael McCandless:
> In general how Lucene assigns docIDs is a volatile implementation
> detail: it's free to change from release to release.
>
> Eg, the default merge policy (TieredMergePolicy) merges out-of-order
> segments. Another eg: at one point, IndexSearcher re-ordered the
> segments on init. Another: because ConcurrentMergeScheduler runs
> different merges in different threads, they can finish in different of
> orders and thus alter how subsequent merges are selected.
>
> Really it's best if you assign your own (app-level) ID field and use
> that, if you need a stable ID.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Mar 27, 2012 at 3:29 AM, Christoph Kaser
> <lucene_list [at] iconparc> wrote:
>> Hi all,
>>
>> I have a search application with 16 million documents that uses custom
>> scores per document using a ValueSource. These values are updated a lot (and
>> sometimes all at once), so I can't really write them into the index for
>> performance reasons. Instead, I simply have a huge array of float values in
>> memory and use the document ID as index in the array.
>> This works great as long as the index is not changed, but as soon as I have
>> a few new documents and deletions, index segments are merged (I suppose) and
>> the document IDs of existing documents change. Is there any way to be
>> informed when document IDs of existing documents change? If so, is there a
>> way to calculate the new document ID from the old one, so I can "convert" my
>> array to the new document IDs?
>>
>> Any help would be greatly appreciated!
>>
>> Best regards,
>> Christoph
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


serera at gmail

Mar 28, 2012, 7:06 AM

Post #6 of 8 (466 views)
Permalink
Re: Document-Ids and Merges [In reply to]

Hi

If you are working with trunk, then I believe that ValUes is what you're
looking for. They allow you to store values at the document level, and then
read then during search either from disk or RAM. They are also segment
based.

I'm not sure how ValueSource is used (I've never used it myself and I'm not
near the code to check), but what I had in mind is something similar to
Collector.setNextReader which allows Collectors to use, e.g. a float[] for
that IndexReader (hint, look at FieldCache).

If ValueSource or ValueSourceQuery can do that, then you could use that
mechanism. If not, you can move to do the scoring at the Collector level.

Sorry for the shallow responses - I'm answering from my mobile and won't be
near the code potentially until next week. Perhaps someone else on the list
can give you some concrete examples. If not, plz continue to ask questions
and I'll do my best to answer ;).

Shai
On Mar 28, 2012 9:34 AM, "Christoph Kaser" <christoph.kaser [at] iconparc>
wrote:

> Hi Shai,
>
> That sounds interesting. However, I am unsure how I can do this. Is there
> a way to store values "with a segment"? How can I get the segment from a
> document ID?
> Here is how my ValueSource looks like at the moment:
>
> public class MyScoreValues extends ValueSource {
> float[] values=...; //float array with reader.maxDoc() entries
>
> public DocValues getValues(IndexReader reader) throws IOException {
> return new DocValues() {
> public float floatVal(int doc) {
> if(doc < values.length)
> return values[doc];
> return 1.0f;
> }
> };
> }
> }
>
> How would I need to change it to make the arrays segment-based?
>
> Best regards,
> Christoph
>
>
>
> Am 27.03.2012 21:16, schrieb Shai Erera:
>
>> Or ... move to use a per-segment array. Then you don't need to rely on doc
>> IDs changing. You will need to build the array from the documents that are
>> in that segment only.
>>
>> It's like FieldCache in a way. The array is relevant as long as the
>> segment
>> exists (i.e. not merged away).
>>
>> Hope this helps.
>>
>> Shai
>> On Mar 27, 2012 9:29 AM, "Christoph Kaser"<lucene_list [at] iconparc**>
>> wrote:
>>
>> Hi all,
>>>
>>> I have a search application with 16 million documents that uses custom
>>> scores per document using a ValueSource. These values are updated a lot
>>> (and sometimes all at once), so I can't really write them into the index
>>> for performance reasons. Instead, I simply have a huge array of float
>>> values in memory and use the document ID as index in the array.
>>> This works great as long as the index is not changed, but as soon as I
>>> have a few new documents and deletions, index segments are merged (I
>>> suppose) and the document IDs of existing documents change. Is there any
>>> way to be informed when document IDs of existing documents change? If so,
>>> is there a way to calculate the new document ID from the old one, so I
>>> can
>>> "convert" my array to the new document IDs?
>>>
>>> Any help would be greatly appreciated!
>>>
>>> Best regards,
>>> Christoph
>>>
>>> ------------------------------****----------------------------**
>>> --**---------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene****apache.org<
>>> java-user-**unsubscribe [at] lucene<java-user-unsubscribe [at] lucene>
>>> >
>>> For additional commands, e-mail: java-user-help [at] lucene****org<
>>> java-user-help [at] lucene**apache.org <java-user-help [at] lucene>>
>>>
>>>
>>>
>
> --
> Dipl.-Inf. Christoph Kaser
>
> IconParc GmbH
> Sophienstrasse 1
> 80333 München
>
> www.iconparc.de
>
> Tel +49 -89- 15 90 06 - 21
> Fax +49 -89- 15 90 06 - 49
>
> Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
> 121830, Amtsgericht München
>
>
>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene**apache.org<java-user-unsubscribe [at] lucene>
> For additional commands, e-mail: java-user-help [at] lucene**org<java-user-help [at] lucene>
>
>


lucene at mikemccandless

Mar 28, 2012, 10:40 AM

Post #7 of 8 (464 views)
Permalink
Re: Document-Ids and Merges [In reply to]

On Wed, Mar 28, 2012 at 3:37 AM, Christoph Kaser
<christoph.kaser [at] iconparc> wrote:
> Thank you for your answer!
>
> That's too bad. I thought of using my own ID-field, but I wanted to save the
> additional indirection (from docId to my ID to my value).
> Do document IDs remain constant for one IndexReader as long as it isn't
> reopened? If so, I could precalculate the indirection.

Yes, the entire view of the index presented by a single IndexReader is
unchanging (not just docIDs: everything).

On reopen, a new IndexReader is returned, so the old IndexReader is
still unchanged.

So, if you can hold your arrays per-segment, and init them per-segment
(such as FieldCache, or DocValues (only in 4.0) as Shai described)
then you can safely use the docID to index those arrays just within
the context of that segment.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene_list at iconparc

Apr 5, 2012, 7:27 AM

Post #8 of 8 (431 views)
Permalink
Re: Document-Ids and Merges [In reply to]

Thank you both Mike and Shai for your answers.

If anyone has a similiar problem:
I ended up using a column that provides my own "document ids", whose
values I got using the fieldcache.
I then precalculate the indirection per IndexReader and store it in a
WeakHashMap<IndexReader,float[]> to save the extra lookup.

Christoph Kaser

Am 28.03.2012 19:40, schrieb Michael McCandless:
> On Wed, Mar 28, 2012 at 3:37 AM, Christoph Kaser
> <christoph.kaser [at] iconparc> wrote:
>> Thank you for your answer!
>>
>> That's too bad. I thought of using my own ID-field, but I wanted to save the
>> additional indirection (from docId to my ID to my value).
>> Do document IDs remain constant for one IndexReader as long as it isn't
>> reopened? If so, I could precalculate the indirection.
> Yes, the entire view of the index presented by a single IndexReader is
> unchanging (not just docIDs: everything).
>
> On reopen, a new IndexReader is returned, so the old IndexReader is
> still unchanged.
>
> So, if you can hold your arrays per-segment, and init them per-segment
> (such as FieldCache, or DocValues (only in 4.0) as Shai described)
> then you can safely use the docID to index those arrays just within
> the context of that segment.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.