Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-1879) Parallel incremental indexing

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 4, 2009, 4:35 AM

Post #1 of 6 (107 views)
Permalink
[jira] Commented: (LUCENE-1879) Parallel incremental indexing

[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773466#action_12773466 ]

Michael McCandless commented on LUCENE-1879:
--------------------------------------------

I wonder if we could change Lucene's index format to make this feature
simpler to implement...

Ie, you're having to go to great lengths (since this is built
"outside" of Lucene's core) to force multiple separate indexes to
share everything but the postings files (merge choices, flush,
deletions files, segments files, turning off the stores, etc.).

What if we could invert this approach, so that we use only single
index/IndexWriter, but we allow "partitioned postings", where sets of
fields are mapped to different postings files in the segment?

Whenever a doc is indexed, postings from the fields are then written
according to this partition. Eg if I map "body" to partition 1, and
"title" to partition 2, then I'd have two sets of postings files for
each segment.

Could something like this work?

> Parallel incremental indexing
> -----------------------------
>
> Key: LUCENE-1879
> URL: https://issues.apache.org/jira/browse/LUCENE-1879
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Michael Busch
> Assignee: Michael Busch
> Fix For: 3.1
>
> Attachments: parallel_incremental_indexing.tar
>
>
> A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing
> Discussion on java-dev:
> http://markmail.org/thread/ql3oxzkob7aqf3jd

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 4, 2009, 2:13 PM

Post #2 of 6 (95 views)
Permalink
[jira] Commented: (LUCENE-1879) Parallel incremental indexing [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773663#action_12773663 ]

Michael Busch commented on LUCENE-1879:
---------------------------------------

I realize the current implementation that's attached here is quite
complicated, because it works on top of Lucene's APIs.

However, I really like its flexibility. You can right now easily
rewrite certain parallel indexes without touching others. I use it in
quite different ways. E.g you can easily load one parallel index into a
RAMDirectory or SSD and leave the other ones on the conventional disk.

LUCENE-2025 only optimizes a certain use case of the parallel indexing,
where you want to (re)write a parallel index containing *only* posting
lists and this will especially improve scenarios like Yonik pointed
out a while ago on java-dev where you want to update only a few
documents, not e.g. a certain field for all documents.

In other use cases it is certainly desirable to have a parallel index
that contains a store. It really depends on what data you want to
update individually.

The version of parallel indexing that goes into Lucene's core I
envision quite differently from the current patch here. That's why I'd
like to refactor the IndexWriter (LUCENE-2026) into SegmentWriter and
let's call it IndexManager (the component that controls flushing,
merging, etc.). You can then have a ParallelSegmentWriter, which
partitions the data into parallel segments, and the IndexManager can
behave the same way as before.

You can keep thinking about the whole index as a collection of segments,
just now it will be a matrix of segments instead of a one-dimensional
list.

E.g. the norms could in the future be a parallel segment with a single
column-stride field that you can update by writing a new generation of
the parallel segment.

Things like two-dimensional merge policies will nicely fit into this
model.

Different SegmentWriter implementations will allow you to write single
segments in different ways, e.g. doc-at-a-time (the default one with
addDocument()) or term-at-a-time (like addIndexes*() works).

So I agree we can achieve updating posting lists the way you describe,
but it will be limited to posting lists then. If we allow (re)writing
*segments* in both dimensions I think we will create a more flexible
approach which is independent on what data structures we add to Lucene
- as long as they are not global to the index but per-segment as most
of Lucene's structures are today.

What do you think? Of course I don't want to over-complicate all this,
but if we can get LUCENE-2026 right, I think we can implement parallel
indexing in this segment-oriented way nicely.

> Parallel incremental indexing
> -----------------------------
>
> Key: LUCENE-1879
> URL: https://issues.apache.org/jira/browse/LUCENE-1879
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Michael Busch
> Assignee: Michael Busch
> Fix For: 3.1
>
> Attachments: parallel_incremental_indexing.tar
>
>
> A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing
> Discussion on java-dev:
> http://markmail.org/thread/ql3oxzkob7aqf3jd

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 6, 2009, 2:17 AM

Post #3 of 6 (86 views)
Permalink
[jira] Commented: (LUCENE-1879) Parallel incremental indexing [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774265#action_12774265 ]

Michael McCandless commented on LUCENE-1879:
--------------------------------------------

This sounds great! In fact your proposal for a ParallelSegmentWriter
is just like what I'm picturing -- making the switching "down low"
instead of "up high" (above Lucene). This'd be more generic than just
the postings files, since all index files can be separately written.

It'd then a low-level question of whether ParallelSegmentWriter stores
its files in different Directories, or, a single directory with
different file names (or maybe sub-directories within a directory, or,
something else). It could even use FileSwitchDirectory, eg to direct
certain segment files to an SSD (another way to achieve your example).

This should also fit well into LUCENE-1458 (flexible indexing) -- one
of the added test cases there creates a per-field codec wrapper that
lets you use a different codec per field. Right now, this means
separate file names in the same Directory for that segment, but we
could allow the codecs to use different Directories (or, FSD as well)
if they wanted to.

{quote}
Different SegmentWriter implementations will allow you to write single
segments in different ways, e.g. doc-at-a-time (the default one with
addDocument()) or term-at-a-time (like addIndexes*() works).
{quote}

Can you elaborate on this? How is addIndexes* term-at-a-time?

{quote}
If we allow (re)writing segments in both dimensions I think we will
create a more flexible approach which is independent on what data
structures we add to Lucene
{quote}

Dimension 1 is the docs, and dimension 2 is the assignment of fields
into separate partitions?


> Parallel incremental indexing
> -----------------------------
>
> Key: LUCENE-1879
> URL: https://issues.apache.org/jira/browse/LUCENE-1879
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Michael Busch
> Assignee: Michael Busch
> Fix For: 3.1
>
> Attachments: parallel_incremental_indexing.tar
>
>
> A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing
> Discussion on java-dev:
> http://markmail.org/thread/ql3oxzkob7aqf3jd

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 6, 2009, 9:43 AM

Post #4 of 6 (83 views)
Permalink
[jira] Commented: (LUCENE-1879) Parallel incremental indexing [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774329#action_12774329 ]

Michael Busch commented on LUCENE-1879:
---------------------------------------

{quote}
This sounds great! In fact your proposal for a ParallelSegmentWriter
is just like what I'm picturing - making the switching "down low"
instead of "up high" (above Lucene). This'd be more generic than just
the postings files, since all index files can be separately written.
{quote}

Right. The goal should it be to be able to use this for updating Lucene internal things (like norms, column-stride fields), but also giving advanced users APIs, so that they can partition their data into parallel indexes according to their update requirements (which the current "above Lucene" approach allows).

{quote}
t'd then a low-level question of whether ParallelSegmentWriter stores
its files in different Directories, or, a single directory with
different file names (or maybe sub-directories within a directory, or,
something else). It could even use FileSwitchDirectory, eg to direct
certain segment files to an SSD (another way to achieve your example).
{quote}

Exactly! We should also keep the distributed indexing use case in mind here. It could make sense for systems like Katta to not only shard in the document direction.

{quote}
This should also fit well into LUCENE-1458
{quote}

Sounds great!


> Parallel incremental indexing
> -----------------------------
>
> Key: LUCENE-1879
> URL: https://issues.apache.org/jira/browse/LUCENE-1879
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Michael Busch
> Assignee: Michael Busch
> Fix For: 3.1
>
> Attachments: parallel_incremental_indexing.tar
>
>
> A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing
> Discussion on java-dev:
> http://markmail.org/thread/ql3oxzkob7aqf3jd

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 6, 2009, 9:57 AM

Post #5 of 6 (84 views)
Permalink
[jira] Commented: (LUCENE-1879) Parallel incremental indexing [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774338#action_12774338 ]

Michael Busch commented on LUCENE-1879:
---------------------------------------

{quote}
Can you elaborate on this? How is addIndexes* term-at-a-time?
{quote}

Let's say we have an index 1 with two fields a and b and you want to create a new parallel index 2 in which you copy all posting lists of field b. You can achieve this by using addDocument(), if you iterate on all posting lists in 1b in parallel and create for each document in 1 a corresponding document in 2 that contains the terms of the postings lists from 1b that have a posting for the current document. This I called the "document-at-a-time approach".

However, this is terribly slow (I tried it out), because of all the posting lists you perform I/O on in parallel. It's far more efficient to copy an entire posting list over from 1b to 2, because then you only perform sequential I/O. And if you use 2.addIndexes(IndexReader(1b)), then exactly this happens, because addIndexes(IndexReader) uses the SegmentMerger to add the index. The SegmentMerger iterates the dictionary and consumes the posting lists sequentially. That's why I called this "term-at-a-time approach". In my experience this is for a similar use case as the one I described here orders of magnitudes more efficient. My doc-at-a-time algorithm ran ~20 hours, the term-at-a-time one 8 minutes! The resulting indexes were identical.


> Parallel incremental indexing
> -----------------------------
>
> Key: LUCENE-1879
> URL: https://issues.apache.org/jira/browse/LUCENE-1879
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Michael Busch
> Assignee: Michael Busch
> Fix For: 3.1
>
> Attachments: parallel_incremental_indexing.tar
>
>
> A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing
> Discussion on java-dev:
> http://markmail.org/thread/ql3oxzkob7aqf3jd

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 6, 2009, 9:59 AM

Post #6 of 6 (83 views)
Permalink
[jira] Commented: (LUCENE-1879) Parallel incremental indexing [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774340#action_12774340 ]

Michael Busch commented on LUCENE-1879:
---------------------------------------

{quote}
Dimension 1 is the docs, and dimension 2 is the assignment of fields
into separate partitions?
{quote}

Yes, dimension 1 is unambiguously the docs. Dimension 2 can be the fields into separate parallel indexes, or also what we call today generations for e.g. the norms files.

> Parallel incremental indexing
> -----------------------------
>
> Key: LUCENE-1879
> URL: https://issues.apache.org/jira/browse/LUCENE-1879
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Reporter: Michael Busch
> Assignee: Michael Busch
> Fix For: 3.1
>
> Attachments: parallel_incremental_indexing.tar
>
>
> A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
> Find details on the wiki page for this feature:
> http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing
> Discussion on java-dev:
> http://markmail.org/thread/ql3oxzkob7aqf3jd

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.