Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

IndexingChain and TermHash

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


renaud.delbru at deri

Nov 6, 2009, 9:29 AM

Post #1 of 10 (1335 views)
Permalink
IndexingChain and TermHash

Hi,

I am trying to modify the indexing chain of Lucene. To start, I have
extracted and modified the default indexing chain. I have just removed
the TermVectorsTermsWriter from the chain, i.e., I instantiate a
TermHash with a null 'nextTermsHash'. So, in the chain, my inverted doc
consumer looks like:

final InvertedDocConsumer termsHash = new TermsHash(documentsWriter,
true, freqProxWriter, null);

From looking at the code of TermsHash, it looks like you can pass a
null value as 'nextTermsHash' parameter. However, I got a NPE during the
TermsHashPerThread initialisation (line 48, primaryPerThread is null).

Is it a normal behavior ? Do TermHash is always waiting for a non null
'nextTermsHash' by design ?

Thanks
--
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Nov 6, 2009, 10:09 AM

Post #2 of 10 (1264 views)
Permalink
Re: IndexingChain and TermHash [In reply to]

To be honest, you are sort of forging new territory here :)

The intention of TermsHash was to allow a null nextTermsHash.
Hopefully fixing a few places will in fact make it work that way.

EG for this, maybe change (in TermsHashPerThread.java) this:

if (nextTermsHash != null) {
// We are primary
charPool = new CharBlockPool(termsHash.docWriter);
primary = true;
} else {
charPool = primaryPerThread.charPool;
primary = false;
}

to:

if (nextTermsHash != null || primaryPerThread == null) {
// We are primary
charPool = new CharBlockPool(termsHash.docWriter);
primary = true;
} else {
charPool = primaryPerThread.charPool;
primary = false;
}

?

Mike

On Fri, Nov 6, 2009 at 12:29 PM, Renaud Delbru <renaud.delbru [at] deri> wrote:
> Hi,
>
> I am trying to modify the indexing chain of Lucene. To start, I have
> extracted and modified the default indexing chain. I have just removed the
> TermVectorsTermsWriter from the chain, i.e., I instantiate a TermHash with a
> null 'nextTermsHash'. So, in the chain, my inverted doc consumer looks like:
>
>  final InvertedDocConsumer termsHash = new TermsHash(documentsWriter, true,
> freqProxWriter, null);
>
> From looking at the code of TermsHash, it looks like you can pass a null
> value as 'nextTermsHash' parameter. However, I got a NPE during the
> TermsHashPerThread initialisation (line 48, primaryPerThread is null).
>
> Is it a normal behavior ? Do TermHash is always waiting for a non null
> 'nextTermsHash' by design ?
>
> Thanks
> --
> Renaud Delbru
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


renaud.delbru at deri

Nov 6, 2009, 10:34 AM

Post #3 of 10 (1256 views)
Permalink
Re: IndexingChain and TermHash [In reply to]

Hi Michael,

Thanks for the quick fix. I have tested it (indexing multiple documents
+ searching), and it seems to work.

On 06/11/09 18:09, Michael McCandless wrote:
> To be honest, you are sort of forging new territory here :)
>
I think so too, not an easy task ;o). I have seen that you have tried to
make modular the indexing chain of Lucene (DocumentsWriter). I still try
to have a good understanding of the default indexing, but I would like
to see how it is easy (or difficult) to modify the format of the
postings. From my current understanding, it seems that only the consumer
at the end of this chain (FreqProxTermsWriter and its consumer
FormatPostingsFieldsWriter) has to be changed to a certain extend.

Thanks for the help.
--
Renaud Delbru
> The intention of TermsHash was to allow a null nextTermsHash.
> Hopefully fixing a few places will in fact make it work that way.
>
> EG for this, maybe change (in TermsHashPerThread.java) this:
>
> if (nextTermsHash != null) {
> // We are primary
> charPool = new CharBlockPool(termsHash.docWriter);
> primary = true;
> } else {
> charPool = primaryPerThread.charPool;
> primary = false;
> }
>
> to:
>
> if (nextTermsHash != null || primaryPerThread == null) {
> // We are primary
> charPool = new CharBlockPool(termsHash.docWriter);
> primary = true;
> } else {
> charPool = primaryPerThread.charPool;
> primary = false;
> }
>
> ?
>
> Mike
>
> On Fri, Nov 6, 2009 at 12:29 PM, Renaud Delbru<renaud.delbru [at] deri> wrote:
>
>> Hi,
>>
>> I am trying to modify the indexing chain of Lucene. To start, I have
>> extracted and modified the default indexing chain. I have just removed the
>> TermVectorsTermsWriter from the chain, i.e., I instantiate a TermHash with a
>> null 'nextTermsHash'. So, in the chain, my inverted doc consumer looks like:
>>
>> final InvertedDocConsumer termsHash = new TermsHash(documentsWriter, true,
>> freqProxWriter, null);
>>
>> From looking at the code of TermsHash, it looks like you can pass a null
>> value as 'nextTermsHash' parameter. However, I got a NPE during the
>> TermsHashPerThread initialisation (line 48, primaryPerThread is null).
>>
>> Is it a normal behavior ? Do TermHash is always waiting for a non null
>> 'nextTermsHash' by design ?
>>
>> Thanks
>> --
>> Renaud Delbru
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Nov 14, 2009, 5:22 AM

Post #4 of 10 (1201 views)
Permalink
Re: IndexingChain and TermHash [In reply to]

On Fri, Nov 6, 2009 at 1:34 PM, Renaud Delbru <renaud.delbru [at] deri> wrote:
> Hi Michael,
>
> Thanks for the quick fix. I have tested it (indexing multiple documents +
> searching), and it seems to work.
>
> On 06/11/09 18:09, Michael McCandless wrote:
>>
>> To be honest, you are sort of forging new territory here :)
>>
>
> I think so too, not an easy task ;o). I have seen that you have tried to
> make modular the indexing chain of Lucene (DocumentsWriter). I still try to
> have a good understanding of the default indexing, but I would like to see
> how it is easy (or difficult) to modify the format of the postings. From my
> current understanding, it seems that only the consumer at the end of this
> chain (FreqProxTermsWriter and its consumer FormatPostingsFieldsWriter) has
> to be changed to a certain extend.

Right, those two classes do the writing of the postings, currently.

But with flexible indexing (LUCENE-1458), still in progress, we hope
to make it more easily pluggable, the codec that actually reads &
writes the postings.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


renaud.delbru at deri

Nov 16, 2009, 4:28 AM

Post #5 of 10 (1174 views)
Permalink
Re: IndexingChain and TermHash [In reply to]

Hi Michael,

I see there is already a huge amount of work already done in
LUCENE-1458. Is there a way to checkout the corresponding branch, and
start to use it ? At least, to see if I can extend it and create my own
Codec.
I have started on my side to abstract the indexing chain of Lucene 2.9,
in order to be able to plug my own chain, but I have the impression that
you've done something similar already (with the codec abstraction).
Would be a pity to lose my time doing something less convenient that
your appraoch.

Thanks.
--
Renaud Delbru

On 14/11/09 13:22, Michael McCandless wrote:
> On Fri, Nov 6, 2009 at 1:34 PM, Renaud Delbru<renaud.delbru [at] deri> wrote:
>
>> Hi Michael,
>>
>> Thanks for the quick fix. I have tested it (indexing multiple documents +
>> searching), and it seems to work.
>>
>> On 06/11/09 18:09, Michael McCandless wrote:
>>
>>> To be honest, you are sort of forging new territory here :)
>>>
>>>
>> I think so too, not an easy task ;o). I have seen that you have tried to
>> make modular the indexing chain of Lucene (DocumentsWriter). I still try to
>> have a good understanding of the default indexing, but I would like to see
>> how it is easy (or difficult) to modify the format of the postings. From my
>> current understanding, it seems that only the consumer at the end of this
>> chain (FreqProxTermsWriter and its consumer FormatPostingsFieldsWriter) has
>> to be changed to a certain extend.
>>
> Right, those two classes do the writing of the postings, currently.
>
> But with flexible indexing (LUCENE-1458), still in progress, we hope
> to make it more easily pluggable, the codec that actually reads&
> writes the postings.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Nov 16, 2009, 5:01 AM

Post #6 of 10 (1177 views)
Permalink
Re: IndexingChain and TermHash [In reply to]

Yes, the branch is here:

https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458

Mark (Miller) periodically re-sync's it to trunk.

All tests should pass, and if you create a new Codec, please share the
experience!

There are not yet many Codecs in existence... the branch has the
"standard" codec (closest to Lucene's current index format, but makes
some compelling improvements to the terms dict), a "pulsing" codec
(which inlines low-freq terms into the terms dict), an intblock codec
(an abstract base for building int-block codecs). There's also the
PForDelta codec, attached to LUCENE-1410, which subclasses the
intblock codec and uses PForDelta encoding. It's probably best to
peek at these example codecs for inspiration on how to build yours.

Mike

On Mon, Nov 16, 2009 at 7:28 AM, Renaud Delbru <renaud.delbru [at] deri> wrote:
> Hi Michael,
>
> I see there is already a huge amount of work already done in LUCENE-1458. Is
> there a way to checkout the corresponding branch, and start to use it ? At
> least, to see if I can extend it and create my own Codec.
> I have started on my side to abstract the indexing chain of Lucene 2.9, in
> order to be able to plug my own chain, but I have the impression that you've
> done something similar already (with the codec abstraction). Would be a pity
> to lose my time doing something less convenient that your appraoch.
>
> Thanks.
> --
> Renaud Delbru
>
> On 14/11/09 13:22, Michael McCandless wrote:
>>
>> On Fri, Nov 6, 2009 at 1:34 PM, Renaud Delbru<renaud.delbru [at] deri>
>>  wrote:
>>
>>>
>>> Hi Michael,
>>>
>>> Thanks for the quick fix. I have tested it (indexing multiple documents +
>>> searching), and it seems to work.
>>>
>>> On 06/11/09 18:09, Michael McCandless wrote:
>>>
>>>>
>>>> To be honest, you are sort of forging new territory here :)
>>>>
>>>>
>>>
>>> I think so too, not an easy task ;o). I have seen that you have tried to
>>> make modular the indexing chain of Lucene (DocumentsWriter). I still try
>>> to
>>> have a good understanding of the default indexing, but I would like to
>>> see
>>> how it is easy (or difficult) to modify the format of the postings. From
>>> my
>>> current understanding, it seems that only the consumer at the end of this
>>> chain (FreqProxTermsWriter and its consumer FormatPostingsFieldsWriter)
>>> has
>>> to be changed to a certain extend.
>>>
>>
>> Right, those two classes do the writing of the postings, currently.
>>
>> But with flexible indexing (LUCENE-1458), still in progress, we hope
>> to make it more easily pluggable, the codec that actually reads&
>> writes the postings.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


renaud.delbru at deri

Nov 16, 2009, 5:38 AM

Post #7 of 10 (1184 views)
Permalink
Re: IndexingChain and TermHash [In reply to]

Hi,

On 16/11/09 13:01, Michael McCandless wrote:
> Yes, the branch is here:
>
> https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458
>
> Mark (Miller) periodically re-sync's it to trunk.
>
Good, thanks !
> All tests should pass, and if you create a new Codec, please share the
> experience!
>
I will.
--
Renaud Delbru

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


renaud.delbru at deri

Dec 11, 2009, 9:19 AM

Post #8 of 10 (992 views)
Permalink
Re: IndexingChain and TermHash [In reply to]

Hi Michael,

I am reporting my experience with the codec interface. I have
successfully implemented my own encoding, which is a kind of simplified
tree-based encoding (similarly to what you can find in XML IR). You can
have more information about my project (siren) on [1]. The basic idea is
to encode a term with three different identifiers, doc id, tuple id, and
cell id, instead of only the doc id. Each term therefore belongs to a
tree leaf and are tagged with the leaf path (doc id, tuple id, cell id).

I have converted the siren project to use my new encoding, all the unit
tests are passing, which is good news (which means, no problem with the
skip lists, term enumeration or posting list reading).

For my use case, I had to "hijack" the normal use of the payload
interface. Indeed, the codec is receiving only the following
information: doc id, position, and payload. In order to pass the tuple
id and cell id to my codec, I had to encode them into a payload in my
analyzer, then decode them in my codec (in the
StandardPositionsConsumer) to encode it into the index (and not encoding
them as payload as in the standard codec). Then, in the
StandardPositionProducer, I had to decode them from the index and
re-encode them into the payload interface in order to made the segment
merger working properly.
So, my remark here is about a potential improvement for the codec
interface. I don't know if it can be done easily and if it is worth it,
but maybe an interface (optional parameter) that allow to pass
additional information from the analyzers (e.g., certain attributes)
directly into the codec can be handy (and without passing them using the
payload features as I have done it).

Another minor problem is that in the current 1458 branch, the
IndexReader.open method that accepts the Codecs object is private. So,
for the moment, I am obliged to first open an IndexWriter with my
codecs, and then use the IndexWriter.getReader to get an IndexReader.

Otherwise, congratulation for this very nice feature and piece of work.
This is something I wanted for a long time (I am doing research in the
domain of inverted index data structure), and this feature opens a wide
range of new possibilities.

I am planning to implement variants of my current codecs in a short term
period, and more complex one (with other skip list methods) in a medium
term period. I will continue to follow the advancement of 1458, test it,
and continue to report you my feedbacks and experiences with it.

Thanks,
Best Regards

[1] http://siren.sindice.com
--
Renaud Delbru

On 16/11/09 13:01, Michael McCandless wrote:
> Yes, the branch is here:
>
> https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458
>
> Mark (Miller) periodically re-sync's it to trunk.
>
> All tests should pass, and if you create a new Codec, please share the
> experience!
>
> There are not yet many Codecs in existence... the branch has the
> "standard" codec (closest to Lucene's current index format, but makes
> some compelling improvements to the terms dict), a "pulsing" codec
> (which inlines low-freq terms into the terms dict), an intblock codec
> (an abstract base for building int-block codecs). There's also the
> PForDelta codec, attached to LUCENE-1410, which subclasses the
> intblock codec and uses PForDelta encoding. It's probably best to
> peek at these example codecs for inspiration on how to build yours.
>
> Mike
>
> On Mon, Nov 16, 2009 at 7:28 AM, Renaud Delbru<renaud.delbru [at] deri> wrote:
>
>> Hi Michael,
>>
>> I see there is already a huge amount of work already done in LUCENE-1458. Is
>> there a way to checkout the corresponding branch, and start to use it ? At
>> least, to see if I can extend it and create my own Codec.
>> I have started on my side to abstract the indexing chain of Lucene 2.9, in
>> order to be able to plug my own chain, but I have the impression that you've
>> done something similar already (with the codec abstraction). Would be a pity
>> to lose my time doing something less convenient that your appraoch.
>>
>> Thanks.
>> --
>> Renaud Delbru
>>
>> On 14/11/09 13:22, Michael McCandless wrote:
>>
>>> On Fri, Nov 6, 2009 at 1:34 PM, Renaud Delbru<renaud.delbru [at] deri>
>>> wrote:
>>>
>>>
>>>> Hi Michael,
>>>>
>>>> Thanks for the quick fix. I have tested it (indexing multiple documents +
>>>> searching), and it seems to work.
>>>>
>>>> On 06/11/09 18:09, Michael McCandless wrote:
>>>>
>>>>
>>>>> To be honest, you are sort of forging new territory here :)
>>>>>
>>>>>
>>>>>
>>>> I think so too, not an easy task ;o). I have seen that you have tried to
>>>> make modular the indexing chain of Lucene (DocumentsWriter). I still try
>>>> to
>>>> have a good understanding of the default indexing, but I would like to
>>>> see
>>>> how it is easy (or difficult) to modify the format of the postings. From
>>>> my
>>>> current understanding, it seems that only the consumer at the end of this
>>>> chain (FreqProxTermsWriter and its consumer FormatPostingsFieldsWriter)
>>>> has
>>>> to be changed to a certain extend.
>>>>
>>>>
>>> Right, those two classes do the writing of the postings, currently.
>>>
>>> But with flexible indexing (LUCENE-1458), still in progress, we hope
>>> to make it more easily pluggable, the codec that actually reads&
>>> writes the postings.
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


renaud.delbru at deri

Jan 7, 2010, 4:46 AM

Post #9 of 10 (874 views)
Permalink
Re: IndexingChain and TermHash [In reply to]

Hi Michael,

I have started to look at the PFOR codec. However, when I include the
codec files inside the flex_1458 branch, it misses the
org.apache.lucene.util.pfor.PFor class which is the core of the codec.
Where can I find this class ?

Thanks,
Regards
--
Renaud Delbru

On 16/11/09 14:01, Michael McCandless wrote:
> Yes, the branch is here:
>
> https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458
>
> Mark (Miller) periodically re-sync's it to trunk.
>
> All tests should pass, and if you create a new Codec, please share the
> experience!
>
> There are not yet many Codecs in existence... the branch has the
> "standard" codec (closest to Lucene's current index format, but makes
> some compelling improvements to the terms dict), a "pulsing" codec
> (which inlines low-freq terms into the terms dict), an intblock codec
> (an abstract base for building int-block codecs). There's also the
> PForDelta codec, attached to LUCENE-1410, which subclasses the
> intblock codec and uses PForDelta encoding. It's probably best to
> peek at these example codecs for inspiration on how to build yours.
>
> Mike
>
> On Mon, Nov 16, 2009 at 7:28 AM, Renaud Delbru<renaud.delbru [at] deri> wrote:
>
>> Hi Michael,
>>
>> I see there is already a huge amount of work already done in LUCENE-1458. Is
>> there a way to checkout the corresponding branch, and start to use it ? At
>> least, to see if I can extend it and create my own Codec.
>> I have started on my side to abstract the indexing chain of Lucene 2.9, in
>> order to be able to plug my own chain, but I have the impression that you've
>> done something similar already (with the codec abstraction). Would be a pity
>> to lose my time doing something less convenient that your appraoch.
>>
>> Thanks.
>> --
>> Renaud Delbru
>>
>> On 14/11/09 13:22, Michael McCandless wrote:
>>
>>> On Fri, Nov 6, 2009 at 1:34 PM, Renaud Delbru<renaud.delbru [at] deri>
>>> wrote:
>>>
>>>
>>>> Hi Michael,
>>>>
>>>> Thanks for the quick fix. I have tested it (indexing multiple documents +
>>>> searching), and it seems to work.
>>>>
>>>> On 06/11/09 18:09, Michael McCandless wrote:
>>>>
>>>>
>>>>> To be honest, you are sort of forging new territory here :)
>>>>>
>>>>>
>>>>>
>>>> I think so too, not an easy task ;o). I have seen that you have tried to
>>>> make modular the indexing chain of Lucene (DocumentsWriter). I still try
>>>> to
>>>> have a good understanding of the default indexing, but I would like to
>>>> see
>>>> how it is easy (or difficult) to modify the format of the postings. From
>>>> my
>>>> current understanding, it seems that only the consumer at the end of this
>>>> chain (FreqProxTermsWriter and its consumer FormatPostingsFieldsWriter)
>>>> has
>>>> to be changed to a certain extend.
>>>>
>>>>
>>> Right, those two classes do the writing of the postings, currently.
>>>
>>> But with flexible indexing (LUCENE-1458), still in progress, we hope
>>> to make it more easily pluggable, the codec that actually reads&
>>> writes the postings.
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Jan 7, 2010, 5:43 AM

Post #10 of 10 (870 views)
Permalink
Re: IndexingChain and TermHash [In reply to]

LUCENE-1410 has the PFor impl, that the PFor codec needs.

Mike

On Thu, Jan 7, 2010 at 7:46 AM, Renaud Delbru <renaud.delbru [at] deri> wrote:
> Hi Michael,
>
> I have started to look at the PFOR codec. However, when I include the codec
> files inside the flex_1458 branch, it misses the
> org.apache.lucene.util.pfor.PFor class which is the core of the codec. Where
> can I find this class ?
>
> Thanks,
> Regards
> --
> Renaud Delbru
>
> On 16/11/09 14:01, Michael McCandless wrote:
>>
>> Yes, the branch is here:
>>
>>     https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458
>>
>> Mark (Miller) periodically re-sync's it to trunk.
>>
>> All tests should pass, and if you create a new Codec, please share the
>> experience!
>>
>> There are not yet many Codecs in existence... the branch has the
>> "standard" codec (closest to Lucene's current index format, but makes
>> some compelling improvements to the terms dict), a "pulsing" codec
>> (which inlines low-freq terms into the terms dict), an intblock codec
>> (an abstract base for building int-block codecs).  There's also the
>> PForDelta codec, attached to LUCENE-1410, which subclasses the
>> intblock codec and uses PForDelta encoding.  It's probably best to
>> peek at these example codecs for inspiration on how to build yours.
>>
>> Mike
>>
>> On Mon, Nov 16, 2009 at 7:28 AM, Renaud Delbru<renaud.delbru [at] deri>
>>  wrote:
>>
>>>
>>> Hi Michael,
>>>
>>> I see there is already a huge amount of work already done in LUCENE-1458.
>>> Is
>>> there a way to checkout the corresponding branch, and start to use it ?
>>> At
>>> least, to see if I can extend it and create my own Codec.
>>> I have started on my side to abstract the indexing chain of Lucene 2.9,
>>> in
>>> order to be able to plug my own chain, but I have the impression that
>>> you've
>>> done something similar already (with the codec abstraction). Would be a
>>> pity
>>> to lose my time doing something less convenient that your appraoch.
>>>
>>> Thanks.
>>> --
>>> Renaud Delbru
>>>
>>> On 14/11/09 13:22, Michael McCandless wrote:
>>>
>>>>
>>>> On Fri, Nov 6, 2009 at 1:34 PM, Renaud Delbru<renaud.delbru [at] deri>
>>>>  wrote:
>>>>
>>>>
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Thanks for the quick fix. I have tested it (indexing multiple documents
>>>>> +
>>>>> searching), and it seems to work.
>>>>>
>>>>> On 06/11/09 18:09, Michael McCandless wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> To be honest, you are sort of forging new territory here :)
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> I think so too, not an easy task ;o). I have seen that you have tried
>>>>> to
>>>>> make modular the indexing chain of Lucene (DocumentsWriter). I still
>>>>> try
>>>>> to
>>>>> have a good understanding of the default indexing, but I would like to
>>>>> see
>>>>> how it is easy (or difficult) to modify the format of the postings.
>>>>> From
>>>>> my
>>>>> current understanding, it seems that only the consumer at the end of
>>>>> this
>>>>> chain (FreqProxTermsWriter and its consumer FormatPostingsFieldsWriter)
>>>>> has
>>>>> to be changed to a certain extend.
>>>>>
>>>>>
>>>>
>>>> Right, those two classes do the writing of the postings, currently.
>>>>
>>>> But with flexible indexing (LUCENE-1458), still in progress, we hope
>>>> to make it more easily pluggable, the codec that actually reads&
>>>> writes the postings.
>>>>
>>>> Mike
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.