Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-1301) Refactor DocumentsWriter

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Jun 15, 2008, 7:35 PM

Post #1 of 2 (281 views)
Permalink
[jira] Commented: (LUCENE-1301) Refactor DocumentsWriter

[ https://issues.apache.org/jira/browse/LUCENE-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605183#action_12605183 ]

Michael Busch commented on LUCENE-1301:
---------------------------------------

Mike, I think the ArrayUtil class is missing in your patch?

> Refactor DocumentsWriter
> ------------------------
>
> Key: LUCENE-1301
> URL: https://issues.apache.org/jira/browse/LUCENE-1301
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.3, 2.3.1, 2.3.2, 2.4
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1301.patch
>
>
> I've been working on refactoring DocumentsWriter to make it more
> modular, so that adding new indexing functionality (like column-stride
> stored fields, LUCENE-1231) is just a matter of adding a plugin into
> the indexing chain.
> This is an initial step towards flexible indexing (but there is still
> alot more to do!).
> And it's very much still a work in progress -- there are intemittant
> thread safety issues, I need to add tests cases and test/iterate on
> performance, many "nocommits", etc. This is a snapshot of my current
> state...
> The approach introduces "consumers" (abstract classes defining the
> interface) at different levels during indexing. EG DocConsumer
> consumes the whole document. DocFieldConsumer consumes separate
> fields, one at a time. InvertedDocConsumer consumes tokens produced
> by running each field through the analyzer. TermsHashConsumer writes
> its own bytes into in-memory posting lists stored in byte slices,
> indexed by term, etc.
> DocumentsWriter*.java is then much simpler: it only interacts with a
> DocConsumer and has no idea what that consumer is doing. Under that
> DocConsumer there is a whole "indexing chain" that does the real work:
> * NormsWriter holds norms in memory and then flushes them to _X.nrm.
> * FreqProxTermsWriter holds postings data in memory and then flushes
> to _X.frq/prx.
> * StoredFieldsWriter flushes immediately to _X.fdx/fdt
> * TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd
> DocumentsWriter still manages things like flushing a segment, closing
> doc stores, buffering & applying deletes, freeing memory, aborting
> when necesary, etc.
> In this first step, everything is package-private, and, the indexing
> chain is hardwired (instantiated in DocumentsWriter) to the chain
> currently matching Lucene trunk. Over time we can open this up.
> There are no changes to the index file format.
> For the most part this is just a [large] refactoring, except for these
> two small actual changes:
> * Improved concurrency with mixed large/small docs: previously the
> thread state would be tied up when docs finished indexing
> out-of-order. Now, it's not: instead I use a separate class to
> hold any pending state to flush to the doc stores, and immediately
> free up the thread state to index other docs.
> * Buffered norms in memory now remain sparse, until flushed to the
> _X.nrm file. Previously we would "fill holes" in norms in memory,
> as we go, which could easily use way too much memory. Really this
> isn't a solution to the problem of sparse norms (LUCENE-830); it
> just delays that issue from causing memory blowup during indexing;
> memory use will still blowup during searching.
> I expect performance (indexing throughput) will be worse with this
> change. I'll profile & iterate to minimize this, but I think we can
> accept some loss. I also plan to measure benefit of manually
> re-cycling RawPostingList instances from our own pool, vs letting GC
> recycle them.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Jun 17, 2008, 6:36 PM

Post #2 of 2 (245 views)
Permalink
[jira] Commented: (LUCENE-1301) Refactor DocumentsWriter [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605798#action_12605798 ]

Michael Busch commented on LUCENE-1301:
---------------------------------------

Just a quick update, Mike:
With your latest patch it's compiling fine now. Thanks!
I'm seeing NullPointerExceptions in TestStressIndexing2 though,
but I guess this patch is not final yet.

I haven't read the patch yet, hope I'll find some time soon.

> Refactor DocumentsWriter
> ------------------------
>
> Key: LUCENE-1301
> URL: https://issues.apache.org/jira/browse/LUCENE-1301
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.3, 2.3.1, 2.3.2, 2.4
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1301.patch, LUCENE-1301.take2.patch
>
>
> I've been working on refactoring DocumentsWriter to make it more
> modular, so that adding new indexing functionality (like column-stride
> stored fields, LUCENE-1231) is just a matter of adding a plugin into
> the indexing chain.
> This is an initial step towards flexible indexing (but there is still
> alot more to do!).
> And it's very much still a work in progress -- there are intemittant
> thread safety issues, I need to add tests cases and test/iterate on
> performance, many "nocommits", etc. This is a snapshot of my current
> state...
> The approach introduces "consumers" (abstract classes defining the
> interface) at different levels during indexing. EG DocConsumer
> consumes the whole document. DocFieldConsumer consumes separate
> fields, one at a time. InvertedDocConsumer consumes tokens produced
> by running each field through the analyzer. TermsHashConsumer writes
> its own bytes into in-memory posting lists stored in byte slices,
> indexed by term, etc.
> DocumentsWriter*.java is then much simpler: it only interacts with a
> DocConsumer and has no idea what that consumer is doing. Under that
> DocConsumer there is a whole "indexing chain" that does the real work:
> * NormsWriter holds norms in memory and then flushes them to _X.nrm.
> * FreqProxTermsWriter holds postings data in memory and then flushes
> to _X.frq/prx.
> * StoredFieldsWriter flushes immediately to _X.fdx/fdt
> * TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd
> DocumentsWriter still manages things like flushing a segment, closing
> doc stores, buffering & applying deletes, freeing memory, aborting
> when necesary, etc.
> In this first step, everything is package-private, and, the indexing
> chain is hardwired (instantiated in DocumentsWriter) to the chain
> currently matching Lucene trunk. Over time we can open this up.
> There are no changes to the index file format.
> For the most part this is just a [large] refactoring, except for these
> two small actual changes:
> * Improved concurrency with mixed large/small docs: previously the
> thread state would be tied up when docs finished indexing
> out-of-order. Now, it's not: instead I use a separate class to
> hold any pending state to flush to the doc stores, and immediately
> free up the thread state to index other docs.
> * Buffered norms in memory now remain sparse, until flushed to the
> _X.nrm file. Previously we would "fill holes" in norms in memory,
> as we go, which could easily use way too much memory. Really this
> isn't a solution to the problem of sparse norms (LUCENE-830); it
> just delays that issue from causing memory blowup during indexing;
> memory use will still blowup during searching.
> I expect performance (indexing throughput) will be worse with this
> change. I'll profile & iterate to minimize this, but I think we can
> accept some loss. I also plan to measure benefit of manually
> re-cycling RawPostingList instances from our own pool, vs letting GC
> recycle them.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.