Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Modification of positional information encoding

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


renaud.delbru at deri

Oct 13, 2008, 6:53 AM

Post #1 of 5 (2232 views)
Permalink
Modification of positional information encoding

Hi,

We are trying to modify the positional encoding of a term occurrence for
experimentation purposes. One solution we adopt is to use payloads to
sotre our own positional information encoding, but with this solution,
it becomes difficult to measure the increase or decrease of index size.
It is why we would like to directly change the positional encoding.

I have seen that Michael McCandless recently refactored the
DocumentsWriter into a flexible indexer chain (see LUCENE-1301). By
analysing the code, I have noticed that only two classes should be
modified when writing a document:

When adding a document with IndexWriter.addDocument
- FieldInvertState (increment and store positions)
- FreqProxTermsWriterPerField.writeProx

When flushing documents:
- FreqProxTermsWriterPerField.appendPostings

I have noticed that only one class should be modified when reading a
document:
- SegmentTermPositions.nextPositions, and
- SegmentTermPositions.readDeltaPositions

Could a member of the Lucene team approve my modifications ? Do I forget
to modify some classes ?

Another question, since the lucene core classes are kind of close, what
is the best way to implement these modifications ? Make a branch of
lucene, and add my new classes to the lucene package
org.apache.lucene.index ? Or do a more elegant solution is possible ?

Thanks in advance,
Regards.
--
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Oct 13, 2008, 7:53 AM

Post #2 of 5 (2119 views)
Permalink
Re: Modification of positional information encoding [In reply to]

Renaud Delbru wrote:

> Hi,
>
> We are trying to modify the positional encoding of a term occurrence
> for experimentation purposes. One solution we adopt is to use
> payloads to sotre our own positional information encoding, but with
> this solution, it becomes difficult to measure the increase or
> decrease of index size. It is why we would like to directly change
> the positional encoding.
>
> I have seen that Michael McCandless recently refactored the
> DocumentsWriter into a flexible indexer chain (see LUCENE-1301). By
> analysing the code, I have noticed that only two classes should be
> modified when writing a document:
>
> When adding a document with IndexWriter.addDocument
> - FieldInvertState (increment and store positions)
> - FreqProxTermsWriterPerField.writeProx
>
> When flushing documents:
> - FreqProxTermsWriterPerField.appendPostings
>
> I have noticed that only one class should be modified when reading a
> document:
> - SegmentTermPositions.nextPositions, and
> - SegmentTermPositions.readDeltaPositions
>
> Could a member of the Lucene team approve my modifications ? Do I
> forget to modify some classes ?

This looks right, though you would also need to modify SegmentMerger
to read & write your new format when merging segments.

Another thing you could do is grep for "omitTf" which should touch
exactly the same places you need to touch.

It'd be awesome to get to the point where this read & write logic is
captured in a single "codec" that's cleanly shared in all these places
("flexible indexing") but we are not quit there yet...

Out of curiosity, what change are you planning on trying?

> Another question, since the lucene core classes are kind of close,
> what is the best way to implement these modifications ? Make a
> branch of lucene, and add my new classes to the lucene package
> org.apache.lucene.index ? Or do a more elegant solution is possible ?

For starters (to try things out) I would just make local modifications
with a lucene source checkout (via svn).

Also, this issue was just opened:

https://issues.apache.org/jira/browse/LUCENE-1419

which would make it possible for classes in the same package
(oal.index) to use their own indexing chain. With that fix, if you
make your own classes in oal.index package, and perhaps subclass the
above classes, you could then create your own indexing chain for
indexing? If you take that approach, please report back so we can
learn how to improve Lucene for these very advanced customizations!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


renaud.delbru at deri

Oct 13, 2008, 8:38 AM

Post #3 of 5 (2122 views)
Permalink
Re: Modification of positional information encoding [In reply to]

Hi,

Michael McCandless wrote:
> This looks right, though you would also need to modify SegmentMerger
> to read & write your new format when merging segments.
>
> Another thing you could do is grep for "omitTf" which should touch
> exactly the same places you need to touch.
Ok, thanks for the pointers. I will examine this part of the Lucene code.
>
> It'd be awesome to get to the point where this read & write logic is
> captured in a single "codec" that's cleanly shared in all these places
> ("flexible indexing") but we are not quit there yet...
Yes, it will be really handy in order to experiment alternative inverted
index structures. At the moment, it requires quite some work, and
reverse engineering, in order to be able to modify the index structure.
>> Another question, since the lucene core classes are kind of close,
>> what is the best way to implement these modifications ? Make a branch
>> of lucene, and add my new classes to the lucene package
>> org.apache.lucene.index ? Or do a more elegant solution is possible ?
>
> For starters (to try things out) I would just make local modifications
> with a lucene source checkout (via svn).
>
> Also, this issue was just opened:
>
> https://issues.apache.org/jira/browse/LUCENE-1419
>
> which would make it possible for classes in the same package
> (oal.index) to use their own indexing chain. With that fix, if you
> make your own classes in oal.index package, and perhaps subclass the
> above classes, you could then create your own indexing chain for
> indexing? If you take that approach, please report back so we can
> learn how to improve Lucene for these very advanced customizations!
Ok, thanks for the reference. I will try this solution, and will report
you any problems I will encounter.

Regards.
--
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


renaud.delbru at deri

Oct 14, 2008, 6:35 AM

Post #4 of 5 (2123 views)
Permalink
Re: Modification of positional information encoding [In reply to]

Hi Michael,

Michael McCandless wrote:
> Also, this issue was just opened:
>
>
> https://issues.apache.org/jira/browse/LUCENE-1419
>
> which would make it possible for classes in the same package
> (oal.index) to use their own indexing chain. With that fix, if you
> make your own classes in oal.index package, and perhaps subclass the
> above classes, you could then create your own indexing chain for
> indexing? If you take that approach, please report back so we can
> learn how to improve Lucene for these very advanced customizations!
>
As a first impression, what will be handy in order to customize postings
list will be to make an abstract class FreqProxTermsWriter, that
separates segment creation and term information serialisation. This
class will implement the generic logic for flushing and appending
postings, but will delegate to subclasses the way you write doc + freq
and prox + payload info.

A first idea will be to have the following abstract methods:
- writeMinState : called by appendPostings, and define how to serialise
one FreqProxFieldMergeState
- writeDocFreq : called by writeMinState, and define how to serialise
docs and freq
- writeProx: called by writeMinState and define how to serialise
positions and payloads

I think other parts of the FreqProxTermsWriter can stay generic. What do
you think ?

Regards.
--
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Oct 15, 2008, 2:33 AM

Post #5 of 5 (2093 views)
Permalink
Re: Modification of positional information encoding [In reply to]

Renaud Delbru wrote:

> Hi Michael,
>
> Michael McCandless wrote:
>> Also, this issue was just opened:
>>
>>
>> https://issues.apache.org/jira/browse/LUCENE-1419
>>
>> which would make it possible for classes in the same package
>> (oal.index) to use their own indexing chain. With that fix, if you
>> make your own classes in oal.index package, and perhaps subclass
>> the above classes, you could then create your own indexing chain
>> for indexing? If you take that approach, please report back so we
>> can learn how to improve Lucene for these very advanced
>> customizations!
>>
> As a first impression, what will be handy in order to customize
> postings list will be to make an abstract class FreqProxTermsWriter,
> that separates segment creation and term information serialisation.
> This class will implement the generic logic for flushing and
> appending postings, but will delegate to subclasses the way you
> write doc + freq and prox + payload info.
>
> A first idea will be to have the following abstract methods:
> - writeMinState : called by appendPostings, and define how to
> serialise one FreqProxFieldMergeState
> - writeDocFreq : called by writeMinState, and define how to
> serialise docs and freq
> - writeProx: called by writeMinState and define how to serialise
> positions and payloads
>
> I think other parts of the FreqProxTermsWriter can stay generic.
> What do you think ?

I agree: let's decouple the "codec" (how to write terms/freq/prox)
from the other mechanics in FreqProxTermsWriter.

I don't think FreqProxFieldMergeState should be visible to that codec,
though. That class is used, internally to FreqProxTermsWriter, to
manage the multiple threads that had accumulated postings data.

I think the codec API could look something like this:

newField(...)
startTerm(...)
startDocument(...)
addPosition(...)
endDocument(...)
endTerm(...)

We would then make a codec that matches today's index file format, but
allow for others (you) to swap in a new codec. All of this would be
experimental & private to oal.index for starters.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.