Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Updated: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 10, 2009, 4:51 PM

Post #1 of 8 (506 views)
Permalink
[jira] Updated: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode

[ https://issues.apache.org/jira/browse/LUCENE-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2047:
-------------------------------------

Attachment: LUCENE-2047.patch

Deletes occur immediately if poolReader is true

I'm not sure updateDocument needs to delete immediately, as it's also writing a document, the deletes later would be lost in the noise.

> IndexWriter should immediately resolve deleted docs to docID in near-real-time mode
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-2047
> URL: https://issues.apache.org/jira/browse/LUCENE-2047
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-2047.patch
>
>
> Spinoff from LUCENE-1526.
> When deleteDocuments(Term) is called, we currently always buffer the
> Term and only later, when it's time to flush deletes, resolve to
> docIDs. This is necessary because we don't in general hold
> SegmentReaders open.
> But, when IndexWriter is in NRT mode, we pool the readers, and so
> deleting in the foreground is possible.
> It's also beneficial, in that in can reduce the turnaround time when
> reopening a new NRT reader by taking this resolution off the reopen
> path. And if multiple threads are used to do the deletion, then we
> gain concurrency, vs reopen which is not concurrent when flushing the
> deletes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 4:57 PM

Post #2 of 8 (485 views)
Permalink
[jira] Updated: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2047:
---------------------------------------

Fix Version/s: (was: 3.0)
3.1

Pushing to 3.1...

> IndexWriter should immediately resolve deleted docs to docID in near-real-time mode
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-2047
> URL: https://issues.apache.org/jira/browse/LUCENE-2047
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2047.patch
>
>
> Spinoff from LUCENE-1526.
> When deleteDocuments(Term) is called, we currently always buffer the
> Term and only later, when it's time to flush deletes, resolve to
> docIDs. This is necessary because we don't in general hold
> SegmentReaders open.
> But, when IndexWriter is in NRT mode, we pool the readers, and so
> deleting in the foreground is possible.
> It's also beneficial, in that in can reduce the turnaround time when
> reopening a new NRT reader by taking this resolution off the reopen
> path. And if multiple threads are used to do the deletion, then we
> gain concurrency, vs reopen which is not concurrent when flushing the
> deletes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 11, 2009, 10:05 PM

Post #3 of 8 (466 views)
Permalink
[jira] Updated: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2047:
-------------------------------------

Attachment: LUCENE-2047.patch

I added deleting live for updateDocument.

TestNRTReaderWithThreads and TestIndexWriterReader passes.



> IndexWriter should immediately resolve deleted docs to docID in near-real-time mode
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-2047
> URL: https://issues.apache.org/jira/browse/LUCENE-2047
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2047.patch, LUCENE-2047.patch
>
>
> Spinoff from LUCENE-1526.
> When deleteDocuments(Term) is called, we currently always buffer the
> Term and only later, when it's time to flush deletes, resolve to
> docIDs. This is necessary because we don't in general hold
> SegmentReaders open.
> But, when IndexWriter is in NRT mode, we pool the readers, and so
> deleting in the foreground is possible.
> It's also beneficial, in that in can reduce the turnaround time when
> reopening a new NRT reader by taking this resolution off the reopen
> path. And if multiple threads are used to do the deletion, then we
> gain concurrency, vs reopen which is not concurrent when flushing the
> deletes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 7:18 PM

Post #4 of 8 (427 views)
Permalink
[jira] Updated: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2047:
-------------------------------------

Attachment: LUCENE-2047.patch

* There's pending deletes (aka updateDoc generated deletes) per
SR. They're stored in a pending deletes BV in SR.

* commitMergedDeletes maps the pending deletes into the
mergeReader.

* DW.abort clears the pending deletes from all SRs.

* On successful flush, the SR pending deletes are merged into the
primary del docs BV.

* Deletes are still buffered, however they're only applied to the
newly flushed segment (rather than all readers). If the applying
fails, I think we need to keep some of the rollback from the
original applyDeletes?

* The foreground deleting seems to break a couple of tests in
TestIndexWriter.

Mike, you mentioned testing getReader missing deletes etc (in
response to potential file handle leakage), which test or
benchmark did you use for this?


> IndexWriter should immediately resolve deleted docs to docID in near-real-time mode
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-2047
> URL: https://issues.apache.org/jira/browse/LUCENE-2047
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch
>
>
> Spinoff from LUCENE-1526.
> When deleteDocuments(Term) is called, we currently always buffer the
> Term and only later, when it's time to flush deletes, resolve to
> docIDs. This is necessary because we don't in general hold
> SegmentReaders open.
> But, when IndexWriter is in NRT mode, we pool the readers, and so
> deleting in the foreground is possible.
> It's also beneficial, in that in can reduce the turnaround time when
> reopening a new NRT reader by taking this resolution off the reopen
> path. And if multiple threads are used to do the deletion, then we
> gain concurrency, vs reopen which is not concurrent when flushing the
> deletes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 22, 2009, 6:53 PM

Post #5 of 8 (377 views)
Permalink
[jira] Updated: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2047:
-------------------------------------

Attachment: LUCENE-2047.patch

In the updateDocument and deleteDocument methods, deletes are
buffered per segment reader synchronized on writer. Immediately
after, outside the sync block, they're deleted from the existing
SRs. If a new SR is added, it's because of a flush (which has
it's own buffered deletes), or from a merge which is
synchronized on writer. In other words, we won't lose deletes,
they'll always be applied on a flush, and the resolution of
deletes to doc ids happens un-synchronized on writer.

Update document adds the term to the queued deletes, then
resolves and adds the doc ids to an Integer list (for now). This
is where we may want to use an growable int[] or int set.

Flush applies queued update doc deleted doc ids to the SRs.

commitMerge merges queued deletes from the incoming SRs. Doc ids
are mapped to the merged reader.

> IndexWriter should immediately resolve deleted docs to docID in near-real-time mode
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-2047
> URL: https://issues.apache.org/jira/browse/LUCENE-2047
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch
>
>
> Spinoff from LUCENE-1526.
> When deleteDocuments(Term) is called, we currently always buffer the
> Term and only later, when it's time to flush deletes, resolve to
> docIDs. This is necessary because we don't in general hold
> SegmentReaders open.
> But, when IndexWriter is in NRT mode, we pool the readers, and so
> deleting in the foreground is possible.
> It's also beneficial, in that in can reduce the turnaround time when
> reopening a new NRT reader by taking this resolution off the reopen
> path. And if multiple threads are used to do the deletion, then we
> gain concurrency, vs reopen which is not concurrent when flushing the
> deletes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 22, 2009, 7:03 PM

Post #6 of 8 (376 views)
Permalink
[jira] Updated: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2047:
-------------------------------------

Attachment: LUCENE-2047.patch

When docWriter aborts on the RAM buffer, we clear out the queued updateDoc deletes.

> IndexWriter should immediately resolve deleted docs to docID in near-real-time mode
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-2047
> URL: https://issues.apache.org/jira/browse/LUCENE-2047
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch
>
>
> Spinoff from LUCENE-1526.
> When deleteDocuments(Term) is called, we currently always buffer the
> Term and only later, when it's time to flush deletes, resolve to
> docIDs. This is necessary because we don't in general hold
> SegmentReaders open.
> But, when IndexWriter is in NRT mode, we pool the readers, and so
> deleting in the foreground is possible.
> It's also beneficial, in that in can reduce the turnaround time when
> reopening a new NRT reader by taking this resolution off the reopen
> path. And if multiple threads are used to do the deletion, then we
> gain concurrency, vs reopen which is not concurrent when flushing the
> deletes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 22, 2009, 9:10 PM

Post #7 of 8 (382 views)
Permalink
[jira] Updated: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2047:
-------------------------------------

Attachment: LUCENE-2047.patch

DocWriter abort created a deadlock noticed in TestIndexWriter.testIOExceptionDuringAbortWithThreads. This is fixed by clearing via the reader pool. Other tests fail in TIW.

> IndexWriter should immediately resolve deleted docs to docID in near-real-time mode
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-2047
> URL: https://issues.apache.org/jira/browse/LUCENE-2047
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch
>
>
> Spinoff from LUCENE-1526.
> When deleteDocuments(Term) is called, we currently always buffer the
> Term and only later, when it's time to flush deletes, resolve to
> docIDs. This is necessary because we don't in general hold
> SegmentReaders open.
> But, when IndexWriter is in NRT mode, we pool the readers, and so
> deleting in the foreground is possible.
> It's also beneficial, in that in can reduce the turnaround time when
> reopening a new NRT reader by taking this resolution off the reopen
> path. And if multiple threads are used to do the deletion, then we
> gain concurrency, vs reopen which is not concurrent when flushing the
> deletes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 23, 2009, 6:26 PM

Post #8 of 8 (347 views)
Permalink
[jira] Updated: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2047:
-------------------------------------

Attachment: LUCENE-2047.patch

TestIndexWriter passes, mostly due to removing assertions in
reader pool release which presumed deletions would be
synchronized with closing a reader. They're not anymore.
Deletions can come in at almost anytime, so a reader may be
closed by the pool while still carrying deletes. The releasing
thread may not be synchronized on writer because we're allowing
deletions to occur un-synchronized.

I suppose we need more tests to insure the assertions are in
fact not needed.

> IndexWriter should immediately resolve deleted docs to docID in near-real-time mode
> -----------------------------------------------------------------------------------
>
> Key: LUCENE-2047
> URL: https://issues.apache.org/jira/browse/LUCENE-2047
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch, LUCENE-2047.patch
>
>
> Spinoff from LUCENE-1526.
> When deleteDocuments(Term) is called, we currently always buffer the
> Term and only later, when it's time to flush deletes, resolve to
> docIDs. This is necessary because we don't in general hold
> SegmentReaders open.
> But, when IndexWriter is in NRT mode, we pool the readers, and so
> deleting in the foreground is possible.
> It's also beneficial, in that in can reduce the turnaround time when
> reopening a new NRT reader by taking this resolution off the reopen
> path. And if multiple threads are used to do the deletion, then we
> gain concurrency, vs reopen which is not concurrent when flushing the
> deletes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.