Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Updated: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 7, 2009, 6:25 AM

Post #1 of 4 (132 views)
Permalink
[jira] Updated: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1526:
---------------------------------------

Summary: For near real-time search, use paged copy-on-write BitVector impl (was: Tombstone deletions in IndexReader)

Changing summary to match evolution of this issue...

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 12, 2009, 11:00 AM

Post #2 of 4 (105 views)
Permalink
[jira] Updated: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1526:
-------------------------------------

Attachment: (was: LUCENE-1526.patch)

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 16, 2009, 7:05 PM

Post #3 of 4 (81 views)
Permalink
[jira] Updated: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1526:
-------------------------------------

Attachment: LUCENE-1526.patch

Here's a working version of this. The page size is statically
configurable by adjusting CONST in PagedBitVector. I set it on
the high side because the next thing is to inline the page and
doc checking into SegmentTermDocs for benchmarking.

The test is fairly randomized, though I think there's more that
can be added.

The pages are saved one by one, either as dgaps or bytes, which
means the .del file format has changed. We can probably read the
old format, write the new format if this is deployed.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Nov 16, 2009, 10:41 PM

Post #4 of 4 (76 views)
Permalink
[jira] Updated: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1526:
-------------------------------------

Attachment: LUCENE-1526.patch

Inlined into SegmentTermDocs. If there's an issue with the del
docs null check we could go extreme and instantiate specialized
instances of SegTD that doesn't perform the check. I'm not sure
what would slow this down, but profiling will lets us know
what's up.

TestIndexReaderReopen and TestIndexWriterReader passes so I
figure we're ready for benchmarking.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch, LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.