Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Question about Lucene 2.3. file formats?

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


ivasilev at sirma

Jan 22, 2008, 11:24 AM

Post #1 of 3 (764 views)
Permalink
Question about Lucene 2.3. file formats?

Hi Lucene Guys,

As I see in the Lucene web site in file formats page the version 2.3
will have some changes in file formats that are very important for us.
First I will say what we do and then will ask my questions.

We distribute the index on some machines. The implementation is made so
that we copy some segments to one machine and for them we create the
segments_N metadata file according to the rules described in Lucene web
site. Which exactly segments we will move to other machine we calculate
based on the available disk spaces and the size in bytes of the
segments. Now as I see you will use data sharing so that some segments
will store documents of some other segments. This rise some questions in
us regarding how to support our clusterization for Lucene 2.3.

1. Is this sharing temporary or it is constant? I mean is sharing
will take place only in the process of adding documents to index
and after that, may be when optimization or some other process is
run the shared documents are distributed among the segments that
use them? Or it is possible that shared documents on a segment
will remain shared after optimizing?
2. Is there way to unshare documents I mean when transferring a
segment to some other machine can I transfer its documents from
the other segment that holds them to it?
3. As I see in the source code in SVN of Lucene 2.3. there is class
LogByteSizeMergePolicy that allows controlling the maximal size of
segment that could be merged. Here I have two questions:

3.1. Can I control not only the max size of segments that will be
merged, but also the max size (or approximate max size) of segments that
would occur after merging?

3.2.Can I somehow control the maximal size of segment at all (or may be
its approximate maximal size I mean to stop adding documents to a
segment after it reaches some size)?

3.3.Can I somehow control the maximal size of a segment and all other
segments which documents are shared in it?

Best Regards,
Ivan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Jan 22, 2008, 12:07 PM

Post #2 of 3 (715 views)
Permalink
Re: Question about Lucene 2.3. file formats? [In reply to]

Ivan Vasilev wrote:

> Hi Lucene Guys,
>
> As I see in the Lucene web site in file formats page the version
> 2.3 will have some changes in file formats that are very important
> for us. First I will say what we do and then will ask my questions.
>
> We distribute the index on some machines. The implementation is
> made so that we copy some segments to one machine and for them we
> create the segments_N metadata file according to the rules
> described in Lucene web site. Which exactly segments we will move
> to other machine we calculate based on the available disk spaces
> and the size in bytes of the segments. Now as I see you will use
> data sharing so that some segments will store documents of some
> other segments. This rise some questions in us regarding how to
> support our clusterization for Lucene 2.3.

Are you referring to sharing of the docStore files (term vectors &
stored fields), when autoCommit=false? Assuming so....

> 1. Is this sharing temporary or it is constant? I mean is sharing
> will take place only in the process of adding documents to index
> and after that, may be when optimization or some other process is
> run the shared documents are distributed among the segments that
> use them? Or it is possible that shared documents on a segment
> will remain shared after optimizing?

Only documents added in a single IndexWriter session are shared. If
you run optimize, assuming the index was not already optimized, the
sharing is removed since there's only one segment at the end.

Also, if you open with autoCommit=true then there is no sharing.

> 2. Is there way to unshare documents I mean when transferring a
> segment to some other machine can I transfer its documents from
> the other segment that holds them to it?

Use writer.addIndexesNoOptimize (in general, that is the safest way
to merge indices, instead of trying to build your own segments_N file).

> 3. As I see in the source code in SVN of Lucene 2.3. there is class
> LogByteSizeMergePolicy that allows controlling the maximal
> size of
> segment that could be merged. Here I have two questions:
>
> 3.1. Can I control not only the max size of segments that will be
> merged, but also the max size (or approximate max size) of segments
> that would occur after merging?

Not really since it's hard to predict the size the segment will be
after merging. It only limits the max size of a segment that may be
merged. You could roughly guess the final size of the segments (say
sum of all byte sizes, proportionally reduced based on pending
deletes) and put that into your own MergePolicy?

> 3.2.Can I somehow control the maximal size of segment at all (or
> may be its approximate maximal size I mean to stop adding
> documents to a segment after it reaches some size)?

Lucene never adds docs to a segment, except by merging. So by
preventing a segment > size X from being merged (using
LogByteSizeMergePolicy), I think that's the best you can do.

> 3.3.Can I somehow control the maximal size of a segment and all
> other segments which documents are shared in it?

Not really, but your own merge policy could roughly guess...

Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ivasilev at sirma

Jan 23, 2008, 6:25 AM

Post #3 of 3 (711 views)
Permalink
Re: Question about Lucene 2.3. file formats? [In reply to]

Thanks Michael for your answer :)

Actually writer.addIndexesNoOptimize method can not help us because our
aim is to split indexes rather than to merge them. But you information
about setting autoCommit=true is very helpful for us because so we will
avoid sharing of stored fields and will be able to use our tools for
splitting index. The only thing that we will have to do is to add (-1)
in position of DocStoreOffset in segments_N file.

Thanks,
Ivan

Michael McCandless wrote:
> Ivan Vasilev wrote:
>
>> Hi Lucene Guys,
>>
>> As I see in the Lucene web site in file formats page the version 2.3
>> will have some changes in file formats that are very important for
>> us. First I will say what we do and then will ask my questions.
>>
>> We distribute the index on some machines. The implementation is made
>> so that we copy some segments to one machine and for them we create
>> the segments_N metadata file according to the rules described in
>> Lucene web site. Which exactly segments we will move to other machine
>> we calculate based on the available disk spaces and the size in bytes
>> of the segments. Now as I see you will use data sharing so that some
>> segments will store documents of some other segments. This rise some
>> questions in us regarding how to support our clusterization for
>> Lucene 2.3.
>
> Are you referring to sharing of the docStore files (term vectors &
> stored fields), when autoCommit=false? Assuming so....
>
>> 1. Is this sharing temporary or it is constant? I mean is sharing
>> will take place only in the process of adding documents to index
>> and after that, may be when optimization or some other process is
>> run the shared documents are distributed among the segments that
>> use them? Or it is possible that shared documents on a segment
>> will remain shared after optimizing?
>
> Only documents added in a single IndexWriter session are shared. If
> you run optimize, assuming the index was not already optimized, the
> sharing is removed since there's only one segment at the end.
>
> Also, if you open with autoCommit=true then there is no sharing.
>
>> 2. Is there way to unshare documents I mean when transferring a
>> segment to some other machine can I transfer its documents from
>> the other segment that holds them to it?
>
> Use writer.addIndexesNoOptimize (in general, that is the safest way to
> merge indices, instead of trying to build your own segments_N file).
>
>> 3. As I see in the source code in SVN of Lucene 2.3. there is class
>> LogByteSizeMergePolicy that allows controlling the maximal size of
>> segment that could be merged. Here I have two questions:
>>
>> 3.1. Can I control not only the max size of segments that will be
>> merged, but also the max size (or approximate max size) of segments
>> that would occur after merging?
>
> Not really since it's hard to predict the size the segment will be
> after merging. It only limits the max size of a segment that may be
> merged. You could roughly guess the final size of the segments (say
> sum of all byte sizes, proportionally reduced based on pending
> deletes) and put that into your own MergePolicy?
>
>> 3.2.Can I somehow control the maximal size of segment at all (or may
>> be its approximate maximal size I mean to stop adding documents to
>> a segment after it reaches some size)?
>
> Lucene never adds docs to a segment, except by merging. So by
> preventing a segment > size X from being merged (using
> LogByteSizeMergePolicy), I think that's the best you can do.
>
>> 3.3.Can I somehow control the maximal size of a segment and all other
>> segments which documents are shared in it?
>
> Not really, but your own merge policy could roughly guess...
>
> Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> __________ NOD32 2815 (20080122) Information __________
>
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.