Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

addIndexesNoOptimize on shards --> is docid deterministic and calculable?

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


gbrits at gmail

Nov 4, 2009, 6:23 AM

Post #1 of 8 (281 views)
Permalink
addIndexesNoOptimize on shards --> is docid deterministic and calculable?

Hi,

say I have:
- Indexreader[] readers = {reader1, reader2, reader3} //containing all
different docs
- I know the internal docids of documents in reader1, reader2, reader3
seperately

Does doing IndexWriter.addIndexesNoOptimize(Indexreader[] readers) on these
readers give me a determinstic and calculable set of docids on the documents
in the resulting documentWriter?

i.e: from http://lucene.apache.org/java/2_4_1/fileformats.html:
"The numbers stored in each segment are unique only within the segment, and
must be converted before they can be used in a larger context. The standard
technique is to allocate each segment a range of values, based on the range
of numbers used in that segment. To convert a document number from a segment
to an external value, the segment's base document number is added."

Does assinging docids in addIndexesNoOptimize work like this?
in other words:
- docids of docs in reader1 stay the same in indexwriter
- docids of docs in reader2 are incremented by reader1.docs.size();
- docids of docs in reader3 are incremented by reader1.docs.size() +
reader2.docs.size()

Thanks,
Geert-Jan
--
View this message in context: http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26197146.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


gbrits at gmail

Nov 4, 2009, 6:34 AM

Post #2 of 8 (271 views)
Permalink
Re: addIndexesNoOptimize on shards --> is docid deterministic and calculable? (IF docids of shards seperately are known) [In reply to]

Just to clarify question changed the subject:
addIndexesNoOptimize on shards --> is docid deterministic and calculable?
(IF docids of shards seperately are known)


Britske wrote:
>
> Hi,
>
> say I have:
> - Indexreader[] readers = {reader1, reader2, reader3} //containing all
> different docs
> - I know the internal docids of documents in reader1, reader2, reader3
> seperately
>
> Does doing IndexWriter.addIndexesNoOptimize(Indexreader[] readers) on
> these readers give me a determinstic and calculable set of docids on the
> documents in the resulting documentWriter?
>
> i.e: from http://lucene.apache.org/java/2_4_1/fileformats.html:
> "The numbers stored in each segment are unique only within the segment,
> and must be converted before they can be used in a larger context. The
> standard technique is to allocate each segment a range of values, based on
> the range of numbers used in that segment. To convert a document number
> from a segment to an external value, the segment's base document number is
> added."
>
> Does assinging docids in addIndexesNoOptimize work like this?
> in other words:
> - docids of docs in reader1 stay the same in indexwriter
> - docids of docs in reader2 are incremented by reader1.docs.size();
> - docids of docs in reader3 are incremented by reader1.docs.size() +
> reader2.docs.size()
>
> Thanks,
> Geert-Jan
>

--
View this message in context: http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26197347.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


erickerickson at gmail

Nov 4, 2009, 7:10 AM

Post #3 of 8 (271 views)
Permalink
Re: addIndexesNoOptimize on shards --> is docid deterministic and calculable? [In reply to]

Hmmmm, why do you care? That is, what is it you're trying to do
that makes this question necessary? There might be a better
solution than trying to depend on doc IDs.

Because I don't think you can assume that, even if it is deterministic
with the version you're using now that it would be in some other version,
Lucene makes no promises here.

All the advice I've ever seen says that if you want to keep track of
documents, you assign and index your own ID. You can get the
doc ID from your unique term quite efficiently if you need to.

HTH
Erick

On Wed, Nov 4, 2009 at 9:23 AM, Britske <gbrits[at]gmail.com> wrote:

>
> Hi,
>
> say I have:
> - Indexreader[] readers = {reader1, reader2, reader3} //containing all
> different docs
> - I know the internal docids of documents in reader1, reader2, reader3
> seperately
>
> Does doing IndexWriter.addIndexesNoOptimize(Indexreader[] readers) on these
> readers give me a determinstic and calculable set of docids on the
> documents
> in the resulting documentWriter?
>
> i.e: from http://lucene.apache.org/java/2_4_1/fileformats.html:
> "The numbers stored in each segment are unique only within the segment, and
> must be converted before they can be used in a larger context. The standard
> technique is to allocate each segment a range of values, based on the range
> of numbers used in that segment. To convert a document number from a
> segment
> to an external value, the segment's base document number is added."
>
> Does assinging docids in addIndexesNoOptimize work like this?
> in other words:
> - docids of docs in reader1 stay the same in indexwriter
> - docids of docs in reader2 are incremented by reader1.docs.size();
> - docids of docs in reader3 are incremented by reader1.docs.size() +
> reader2.docs.size()
>
> Thanks,
> Geert-Jan
> --
> View this message in context:
> http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26197146.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


gbrits at gmail

Nov 4, 2009, 8:06 AM

Post #4 of 8 (270 views)
Permalink
Re: addIndexesNoOptimize on shards --> is docid deterministic and calculable? [In reply to]

This issue is related to post: "merging Parallel indexes (can
indexWriter.addIndexesNoOptimize be used?)"

Among another thing described in the post above, I'm experimenting with a
combination of sharding and vertical partitioning which I feel will increase
my indexing performance a lot, which at the moment is a real problem.
Indexing time is for more than 99% related to a bunch of indexed fields (+/-
20.000 of them, I know that's a lot) which are all pretty much related.

For this I'm considering the following setup:
N boxes will create 2 indexes each: index A containing the 20.000 indexed
fields, and index B contains the rest.

Index B is created using the normal route: indexWriter.addDocument().
But index A will be created using a custom (yet to write) indexer. Since the
indexing client knows a lot of the documents and these particular fields
(basically it can very effciently calculate the inverse indexes for all
these fields and thus more or less directly construct .frq, .tii, .tis
files) I'm pretty sure a lot of time can be gained. That is, once I figure
out the nitty-gritty low level details of writing to these files. Any help
here much appreciated ;-).

At some point all of these indexes over these boxes have to be merged.
there would be 2 routes: (hypothetical methods)

1.
TotalA = mergeShards(box1.A,...boxN.A)
TotalB = mergeShards(box1.B,...boxN.B)
Total = MergeVertical(TotalA, TotalB)

2.
Total 1 = mergeVertical(box1.A,box1.B)
Total 2 = mergeVertical(box2.A,box2.B)
...
Total N = mergeVertical(boxN.A,boxN.B)
Total = mergeShards(Total1,...TotalN)


My question stems from option 1.

After merging shards TotalA and Total2 should have the same docid-order,
because that's a prereq for doing something like:
docwriter.addIndexesNoOptimize(new ParallelReader(TotalA,TotalB))

Sadly your suggestion doesn't work in this situation I think.

However, After having written this I feel option 2 might be better anyway
performance wise, because I have N boxes around which could parallelize:
Total 1 = mergeVertical(box1.A,box1.B)
Total 2 = mergeVertical(box2.A,box2.B)
...
Total N = mergeVertical(boxN.A,boxN.B)

In this situation I don't have to rely on mergeShards to produce a
calculable order of docids, because I do all vertical merges before merging
the shards. Of course for all individual vertical merges docids have to
still be in order but this could be achieved using your suggestion.

And advice or thought on if this route would be worth the effort or not is
much appreciated!

Thanks for clearing my head a bit.

Geert-Jan





Total 1 = mergeVertical(box1.A,box1.B)
TotalB = mergeShards(box1.B,...boxN.B)
Total = MergeVertical(TotalA, TotalB)




At some time I want to merge these parallel indexes but need to ensure that
docids are in order.

I could indeed wait for the first index (which contains all other fields but
the 20.000) to be constructed and optimized and use your suggested method to
go from key --> docid and thus know the order in which I should add the
documents to the second index.
However this requires me to wait for the first



Erick Erickson wrote:
>
> Hmmmm, why do you care? That is, what is it you're trying to do
> that makes this question necessary? There might be a better
> solution than trying to depend on doc IDs.
>
> Because I don't think you can assume that, even if it is deterministic
> with the version you're using now that it would be in some other version,
> Lucene makes no promises here.
>
> All the advice I've ever seen says that if you want to keep track of
> documents, you assign and index your own ID. You can get the
> doc ID from your unique term quite efficiently if you need to.
>
> HTH
> Erick
>
> On Wed, Nov 4, 2009 at 9:23 AM, Britske <gbrits[at]gmail.com> wrote:
>
>>
>> Hi,
>>
>> say I have:
>> - Indexreader[] readers = {reader1, reader2, reader3} //containing all
>> different docs
>> - I know the internal docids of documents in reader1, reader2, reader3
>> seperately
>>
>> Does doing IndexWriter.addIndexesNoOptimize(Indexreader[] readers) on
>> these
>> readers give me a determinstic and calculable set of docids on the
>> documents
>> in the resulting documentWriter?
>>
>> i.e: from http://lucene.apache.org/java/2_4_1/fileformats.html:
>> "The numbers stored in each segment are unique only within the segment,
>> and
>> must be converted before they can be used in a larger context. The
>> standard
>> technique is to allocate each segment a range of values, based on the
>> range
>> of numbers used in that segment. To convert a document number from a
>> segment
>> to an external value, the segment's base document number is added."
>>
>> Does assinging docids in addIndexesNoOptimize work like this?
>> in other words:
>> - docids of docs in reader1 stay the same in indexwriter
>> - docids of docs in reader2 are incremented by reader1.docs.size();
>> - docids of docs in reader3 are incremented by reader1.docs.size() +
>> reader2.docs.size()
>>
>> Thanks,
>> Geert-Jan
>> --
>> View this message in context:
>> http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26197146.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>
>>
>
>

--
View this message in context: http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26199196.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


gbrits at gmail

Nov 4, 2009, 8:08 AM

Post #5 of 8 (271 views)
Permalink
Re: addIndexesNoOptimize on shards --> is docid deterministic and calculable? [In reply to]

please ignore the garbage at the end ;-)


Britske wrote:
>
> This issue is related to post: "merging Parallel indexes (can
> indexWriter.addIndexesNoOptimize be used?)"
>
> Among another thing described in the post above, I'm experimenting with a
> combination of sharding and vertical partitioning which I feel will
> increase my indexing performance a lot, which at the moment is a real
> problem. Indexing time is for more than 99% related to a bunch of indexed
> fields (+/- 20.000 of them, I know that's a lot) which are all pretty much
> related.
>
> For this I'm considering the following setup:
> N boxes will create 2 indexes each: index A containing the 20.000 indexed
> fields, and index B contains the rest.
>
> Index B is created using the normal route: indexWriter.addDocument().
> But index A will be created using a custom (yet to write) indexer. Since
> the indexing client knows a lot of the documents and these particular
> fields (basically it can very effciently calculate the inverse indexes for
> all these fields and thus more or less directly construct .frq, .tii,
> .tis files) I'm pretty sure a lot of time can be gained. That is, once I
> figure out the nitty-gritty low level details of writing to these files.
> Any help here much appreciated ;-).
>
> At some point all of these indexes over these boxes have to be merged.
> there would be 2 routes: (hypothetical methods)
>
> 1.
> TotalA = mergeShards(box1.A,...boxN.A)
> TotalB = mergeShards(box1.B,...boxN.B)
> Total = MergeVertical(TotalA, TotalB)
>
> 2.
> Total 1 = mergeVertical(box1.A,box1.B)
> Total 2 = mergeVertical(box2.A,box2.B)
> ...
> Total N = mergeVertical(boxN.A,boxN.B)
> Total = mergeShards(Total1,...TotalN)
>
>
> My question stems from option 1.
>
> After merging shards TotalA and Total2 should have the same docid-order,
> because that's a prereq for doing something like:
> docwriter.addIndexesNoOptimize(new ParallelReader(TotalA,TotalB))
>
> Sadly your suggestion doesn't work in this situation I think.
>
> However, After having written this I feel option 2 might be better anyway
> performance wise, because I have N boxes around which could parallelize:
> Total 1 = mergeVertical(box1.A,box1.B)
> Total 2 = mergeVertical(box2.A,box2.B)
> ...
> Total N = mergeVertical(boxN.A,boxN.B)
>
> In this situation I don't have to rely on mergeShards to produce a
> calculable order of docids, because I do all vertical merges before
> merging the shards. Of course for all individual vertical merges docids
> have to still be in order but this could be achieved using your
> suggestion.
>
> And advice or thought on if this route would be worth the effort or not is
> much appreciated!
>
> Thanks for clearing my head a bit.
>
> Geert-Jan
>
>
>
>
>
> Total 1 = mergeVertical(box1.A,box1.B)
> TotalB = mergeShards(box1.B,...boxN.B)
> Total = MergeVertical(TotalA, TotalB)
>
>
>
>
> At some time I want to merge these parallel indexes but need to ensure
> that docids are in order.
>
> I could indeed wait for the first index (which contains all other fields
> but the 20.000) to be constructed and optimized and use your suggested
> method to go from key --> docid and thus know the order in which I should
> add the documents to the second index.
> However this requires me to wait for the first
>
>
>
> Erick Erickson wrote:
>>
>> Hmmmm, why do you care? That is, what is it you're trying to do
>> that makes this question necessary? There might be a better
>> solution than trying to depend on doc IDs.
>>
>> Because I don't think you can assume that, even if it is deterministic
>> with the version you're using now that it would be in some other version,
>> Lucene makes no promises here.
>>
>> All the advice I've ever seen says that if you want to keep track of
>> documents, you assign and index your own ID. You can get the
>> doc ID from your unique term quite efficiently if you need to.
>>
>> HTH
>> Erick
>>
>> On Wed, Nov 4, 2009 at 9:23 AM, Britske <gbrits[at]gmail.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> say I have:
>>> - Indexreader[] readers = {reader1, reader2, reader3} //containing all
>>> different docs
>>> - I know the internal docids of documents in reader1, reader2, reader3
>>> seperately
>>>
>>> Does doing IndexWriter.addIndexesNoOptimize(Indexreader[] readers) on
>>> these
>>> readers give me a determinstic and calculable set of docids on the
>>> documents
>>> in the resulting documentWriter?
>>>
>>> i.e: from http://lucene.apache.org/java/2_4_1/fileformats.html:
>>> "The numbers stored in each segment are unique only within the segment,
>>> and
>>> must be converted before they can be used in a larger context. The
>>> standard
>>> technique is to allocate each segment a range of values, based on the
>>> range
>>> of numbers used in that segment. To convert a document number from a
>>> segment
>>> to an external value, the segment's base document number is added."
>>>
>>> Does assinging docids in addIndexesNoOptimize work like this?
>>> in other words:
>>> - docids of docs in reader1 stay the same in indexwriter
>>> - docids of docs in reader2 are incremented by reader1.docs.size();
>>> - docids of docs in reader3 are incremented by reader1.docs.size() +
>>> reader2.docs.size()
>>>
>>> Thanks,
>>> Geert-Jan
>>> --
>>> View this message in context:
>>> http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26197146.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>
>>>
>>
>>
>
>

--
View this message in context: http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26199239.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


erickerickson at gmail

Nov 4, 2009, 8:34 AM

Post #6 of 8 (264 views)
Permalink
Re: addIndexesNoOptimize on shards --> is docid deterministic and calculable? [In reply to]

You're right, my comment was irrelevant. Mostly, I try to make sure
that people aren't asking an "XY problem", That is, asking for how
to do X when what they really want is Y. And most of the posts
I've seen wondering about doc IDs were exactly that, but yours
clearly isn't.

And I'm going to have to defer any other comments to people who
know more about it than I do....

Erick

On Wed, Nov 4, 2009 at 11:08 AM, Britske <gbrits[at]gmail.com> wrote:

>
> please ignore the garbage at the end ;-)
>
>
> Britske wrote:
> >
> > This issue is related to post: "merging Parallel indexes (can
> > indexWriter.addIndexesNoOptimize be used?)"
> >
> > Among another thing described in the post above, I'm experimenting with a
> > combination of sharding and vertical partitioning which I feel will
> > increase my indexing performance a lot, which at the moment is a real
> > problem. Indexing time is for more than 99% related to a bunch of indexed
> > fields (+/- 20.000 of them, I know that's a lot) which are all pretty
> much
> > related.
> >
> > For this I'm considering the following setup:
> > N boxes will create 2 indexes each: index A containing the 20.000 indexed
> > fields, and index B contains the rest.
> >
> > Index B is created using the normal route: indexWriter.addDocument().
> > But index A will be created using a custom (yet to write) indexer. Since
> > the indexing client knows a lot of the documents and these particular
> > fields (basically it can very effciently calculate the inverse indexes
> for
> > all these fields and thus more or less directly construct .frq, .tii,
> > .tis files) I'm pretty sure a lot of time can be gained. That is, once I
> > figure out the nitty-gritty low level details of writing to these files.
> > Any help here much appreciated ;-).
> >
> > At some point all of these indexes over these boxes have to be merged.
> > there would be 2 routes: (hypothetical methods)
> >
> > 1.
> > TotalA = mergeShards(box1.A,...boxN.A)
> > TotalB = mergeShards(box1.B,...boxN.B)
> > Total = MergeVertical(TotalA, TotalB)
> >
> > 2.
> > Total 1 = mergeVertical(box1.A,box1.B)
> > Total 2 = mergeVertical(box2.A,box2.B)
> > ...
> > Total N = mergeVertical(boxN.A,boxN.B)
> > Total = mergeShards(Total1,...TotalN)
> >
> >
> > My question stems from option 1.
> >
> > After merging shards TotalA and Total2 should have the same docid-order,
> > because that's a prereq for doing something like:
> > docwriter.addIndexesNoOptimize(new ParallelReader(TotalA,TotalB))
> >
> > Sadly your suggestion doesn't work in this situation I think.
> >
> > However, After having written this I feel option 2 might be better anyway
> > performance wise, because I have N boxes around which could parallelize:
> > Total 1 = mergeVertical(box1.A,box1.B)
> > Total 2 = mergeVertical(box2.A,box2.B)
> > ...
> > Total N = mergeVertical(boxN.A,boxN.B)
> >
> > In this situation I don't have to rely on mergeShards to produce a
> > calculable order of docids, because I do all vertical merges before
> > merging the shards. Of course for all individual vertical merges docids
> > have to still be in order but this could be achieved using your
> > suggestion.
> >
> > And advice or thought on if this route would be worth the effort or not
> is
> > much appreciated!
> >
> > Thanks for clearing my head a bit.
> >
> > Geert-Jan
> >
> >
> >
> >
> >
> > Total 1 = mergeVertical(box1.A,box1.B)
> > TotalB = mergeShards(box1.B,...boxN.B)
> > Total = MergeVertical(TotalA, TotalB)
> >
> >
> >
> >
> > At some time I want to merge these parallel indexes but need to ensure
> > that docids are in order.
> >
> > I could indeed wait for the first index (which contains all other fields
> > but the 20.000) to be constructed and optimized and use your suggested
> > method to go from key --> docid and thus know the order in which I should
> > add the documents to the second index.
> > However this requires me to wait for the first
> >
> >
> >
> > Erick Erickson wrote:
> >>
> >> Hmmmm, why do you care? That is, what is it you're trying to do
> >> that makes this question necessary? There might be a better
> >> solution than trying to depend on doc IDs.
> >>
> >> Because I don't think you can assume that, even if it is deterministic
> >> with the version you're using now that it would be in some other
> version,
> >> Lucene makes no promises here.
> >>
> >> All the advice I've ever seen says that if you want to keep track of
> >> documents, you assign and index your own ID. You can get the
> >> doc ID from your unique term quite efficiently if you need to.
> >>
> >> HTH
> >> Erick
> >>
> >> On Wed, Nov 4, 2009 at 9:23 AM, Britske <gbrits[at]gmail.com> wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>> say I have:
> >>> - Indexreader[] readers = {reader1, reader2, reader3} //containing all
> >>> different docs
> >>> - I know the internal docids of documents in reader1, reader2, reader3
> >>> seperately
> >>>
> >>> Does doing IndexWriter.addIndexesNoOptimize(Indexreader[] readers) on
> >>> these
> >>> readers give me a determinstic and calculable set of docids on the
> >>> documents
> >>> in the resulting documentWriter?
> >>>
> >>> i.e: from http://lucene.apache.org/java/2_4_1/fileformats.html:
> >>> "The numbers stored in each segment are unique only within the segment,
> >>> and
> >>> must be converted before they can be used in a larger context. The
> >>> standard
> >>> technique is to allocate each segment a range of values, based on the
> >>> range
> >>> of numbers used in that segment. To convert a document number from a
> >>> segment
> >>> to an external value, the segment's base document number is added."
> >>>
> >>> Does assinging docids in addIndexesNoOptimize work like this?
> >>> in other words:
> >>> - docids of docs in reader1 stay the same in indexwriter
> >>> - docids of docs in reader2 are incremented by reader1.docs.size();
> >>> - docids of docs in reader3 are incremented by reader1.docs.size() +
> >>> reader2.docs.size()
> >>>
> >>> Thanks,
> >>> Geert-Jan
> >>> --
> >>> View this message in context:
> >>>
> http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26197146.html
> >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> >>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >>>
> >>>
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26199239.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


gbrits at gmail

Nov 4, 2009, 9:29 AM

Post #7 of 8 (263 views)
Permalink
Re: addIndexesNoOptimize on shards --> is docid deterministic and calculable? [In reply to]

Yeah I understand. Thanks anyway, it cleared my head a bit,

Geert-Jan


Erick Erickson wrote:
>
> You're right, my comment was irrelevant. Mostly, I try to make sure
> that people aren't asking an "XY problem", That is, asking for how
> to do X when what they really want is Y. And most of the posts
> I've seen wondering about doc IDs were exactly that, but yours
> clearly isn't.
>
> And I'm going to have to defer any other comments to people who
> know more about it than I do....
>
> Erick
>
> On Wed, Nov 4, 2009 at 11:08 AM, Britske <gbrits[at]gmail.com> wrote:
>
>>
>> please ignore the garbage at the end ;-)
>>
>>
>> Britske wrote:
>> >
>> > This issue is related to post: "merging Parallel indexes (can
>> > indexWriter.addIndexesNoOptimize be used?)"
>> >
>> > Among another thing described in the post above, I'm experimenting with
>> a
>> > combination of sharding and vertical partitioning which I feel will
>> > increase my indexing performance a lot, which at the moment is a real
>> > problem. Indexing time is for more than 99% related to a bunch of
>> indexed
>> > fields (+/- 20.000 of them, I know that's a lot) which are all pretty
>> much
>> > related.
>> >
>> > For this I'm considering the following setup:
>> > N boxes will create 2 indexes each: index A containing the 20.000
>> indexed
>> > fields, and index B contains the rest.
>> >
>> > Index B is created using the normal route: indexWriter.addDocument().
>> > But index A will be created using a custom (yet to write) indexer.
>> Since
>> > the indexing client knows a lot of the documents and these particular
>> > fields (basically it can very effciently calculate the inverse indexes
>> for
>> > all these fields and thus more or less directly construct .frq, .tii,
>> > .tis files) I'm pretty sure a lot of time can be gained. That is, once
>> I
>> > figure out the nitty-gritty low level details of writing to these
>> files.
>> > Any help here much appreciated ;-).
>> >
>> > At some point all of these indexes over these boxes have to be merged.
>> > there would be 2 routes: (hypothetical methods)
>> >
>> > 1.
>> > TotalA = mergeShards(box1.A,...boxN.A)
>> > TotalB = mergeShards(box1.B,...boxN.B)
>> > Total = MergeVertical(TotalA, TotalB)
>> >
>> > 2.
>> > Total 1 = mergeVertical(box1.A,box1.B)
>> > Total 2 = mergeVertical(box2.A,box2.B)
>> > ...
>> > Total N = mergeVertical(boxN.A,boxN.B)
>> > Total = mergeShards(Total1,...TotalN)
>> >
>> >
>> > My question stems from option 1.
>> >
>> > After merging shards TotalA and Total2 should have the same
>> docid-order,
>> > because that's a prereq for doing something like:
>> > docwriter.addIndexesNoOptimize(new ParallelReader(TotalA,TotalB))
>> >
>> > Sadly your suggestion doesn't work in this situation I think.
>> >
>> > However, After having written this I feel option 2 might be better
>> anyway
>> > performance wise, because I have N boxes around which could
>> parallelize:
>> > Total 1 = mergeVertical(box1.A,box1.B)
>> > Total 2 = mergeVertical(box2.A,box2.B)
>> > ...
>> > Total N = mergeVertical(boxN.A,boxN.B)
>> >
>> > In this situation I don't have to rely on mergeShards to produce a
>> > calculable order of docids, because I do all vertical merges before
>> > merging the shards. Of course for all individual vertical merges docids
>> > have to still be in order but this could be achieved using your
>> > suggestion.
>> >
>> > And advice or thought on if this route would be worth the effort or not
>> is
>> > much appreciated!
>> >
>> > Thanks for clearing my head a bit.
>> >
>> > Geert-Jan
>> >
>> >
>> >
>> >
>> >
>> > Total 1 = mergeVertical(box1.A,box1.B)
>> > TotalB = mergeShards(box1.B,...boxN.B)
>> > Total = MergeVertical(TotalA, TotalB)
>> >
>> >
>> >
>> >
>> > At some time I want to merge these parallel indexes but need to ensure
>> > that docids are in order.
>> >
>> > I could indeed wait for the first index (which contains all other
>> fields
>> > but the 20.000) to be constructed and optimized and use your suggested
>> > method to go from key --> docid and thus know the order in which I
>> should
>> > add the documents to the second index.
>> > However this requires me to wait for the first
>> >
>> >
>> >
>> > Erick Erickson wrote:
>> >>
>> >> Hmmmm, why do you care? That is, what is it you're trying to do
>> >> that makes this question necessary? There might be a better
>> >> solution than trying to depend on doc IDs.
>> >>
>> >> Because I don't think you can assume that, even if it is deterministic
>> >> with the version you're using now that it would be in some other
>> version,
>> >> Lucene makes no promises here.
>> >>
>> >> All the advice I've ever seen says that if you want to keep track of
>> >> documents, you assign and index your own ID. You can get the
>> >> doc ID from your unique term quite efficiently if you need to.
>> >>
>> >> HTH
>> >> Erick
>> >>
>> >> On Wed, Nov 4, 2009 at 9:23 AM, Britske <gbrits[at]gmail.com> wrote:
>> >>
>> >>>
>> >>> Hi,
>> >>>
>> >>> say I have:
>> >>> - Indexreader[] readers = {reader1, reader2, reader3} //containing
>> all
>> >>> different docs
>> >>> - I know the internal docids of documents in reader1, reader2,
>> reader3
>> >>> seperately
>> >>>
>> >>> Does doing IndexWriter.addIndexesNoOptimize(Indexreader[] readers) on
>> >>> these
>> >>> readers give me a determinstic and calculable set of docids on the
>> >>> documents
>> >>> in the resulting documentWriter?
>> >>>
>> >>> i.e: from http://lucene.apache.org/java/2_4_1/fileformats.html:
>> >>> "The numbers stored in each segment are unique only within the
>> segment,
>> >>> and
>> >>> must be converted before they can be used in a larger context. The
>> >>> standard
>> >>> technique is to allocate each segment a range of values, based on the
>> >>> range
>> >>> of numbers used in that segment. To convert a document number from a
>> >>> segment
>> >>> to an external value, the segment's base document number is added."
>> >>>
>> >>> Does assinging docids in addIndexesNoOptimize work like this?
>> >>> in other words:
>> >>> - docids of docs in reader1 stay the same in indexwriter
>> >>> - docids of docs in reader2 are incremented by reader1.docs.size();
>> >>> - docids of docs in reader3 are incremented by reader1.docs.size() +
>> >>> reader2.docs.size()
>> >>>
>> >>> Thanks,
>> >>> Geert-Jan
>> >>> --
>> >>> View this message in context:
>> >>>
>> http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26197146.html
>> >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> >>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26199239.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>
>>
>
>

--
View this message in context: http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26200836.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


lucene at mikemccandless

Nov 7, 2009, 6:35 AM

Post #8 of 8 (199 views)
Permalink
Re: addIndexesNoOptimize on shards --> is docid deterministic and calculable? [In reply to]

Currently the docIDs are in fact logically appended, during
IndexWriter.addIndexes*.

This really is an implementation detail, though, so conceivably this
may change some day...

Mike

On Wed, Nov 4, 2009 at 9:23 AM, Britske <gbrits[at]gmail.com> wrote:
>
> Hi,
>
> say I have:
> - Indexreader[] readers = {reader1, reader2, reader3} //containing all
> different docs
> - I know the internal docids of documents in reader1, reader2, reader3
> seperately
>
> Does doing IndexWriter.addIndexesNoOptimize(Indexreader[] readers) on these
> readers give me a determinstic and calculable set of docids on the documents
> in the resulting documentWriter?
>
> i.e: from http://lucene.apache.org/java/2_4_1/fileformats.html:
> "The numbers stored in each segment are unique only within the segment, and
> must be converted before they can be used in a larger context. The standard
> technique is to allocate each segment a range of values, based on the range
> of numbers used in that segment. To convert a document number from a segment
> to an external value, the segment's base document number is added."
>
> Does assinging docids in addIndexesNoOptimize work like this?
> in other words:
> - docids of docs in reader1 stay the same in indexwriter
> - docids of docs in reader2 are incremented by reader1.docs.size();
> - docids of docs in reader3 are incremented by reader1.docs.size() +
> reader2.docs.size()
>
> Thanks,
> Geert-Jan
> --
> View this message in context: http://old.nabble.com/addIndexesNoOptimize-on-shards---%3E-is-docid-deterministic-and-calculable--tp26197146p26197146.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.