Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

 

 

First page Previous page 1 2 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 7, 2009, 6:31 AM

Post #1 of 45 (1146 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774605#action_12774605 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

Another approach we might take here is to track new deletions using a
sorted tree. New deletions insert in O(log(N)) time, and then we can
efficiently expose a DocIdSetIterator from that.

Then, every search would have this (if present) folded in as a "not"
clause/filter.

Only near real-time readers would have this.

And then, somehow, periodically that tree would have to be merged in
to the real deleted docs if it ever got to big, or, it was time to
commit.

This is a biggish change because suddenly IndexSearcher must be aware
that a given SegmentReader is carrying near-real-time deletions, and
build a BooleanQuery each time. Whereas MultiBitVector nicely keeps
all this under-the-hood.


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 7, 2009, 7:03 AM

Post #2 of 45 (1111 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774628#action_12774628 ]

Jake Mannix commented on LUCENE-1526:
-------------------------------------

bq. Another approach we might take here is to track new deletions using a
sorted tree. New deletions insert in O(log(N)) time, and then we can
efficiently expose a DocIdSetIterator from that.

We did this in Zoie for a while, and it turned out to be a bottleneck - not as much of a bottleneck as continually cloning a bitvector (that was even worse), but still not good. We currently use a bloomfilter on top of an openintset, which performs pretty fantastically: constant-time adds and even-faster constant-time contains() checks, with small size (necessary for the new Reader per query scenario since this requires lots of deep-cloning of this structure).

Just a note from our experience over in zoie-land. It also helped to not produce a docIdset iterator using these bits, but instead override TermDocs to be returned on the disk reader, and keep track of it directly there.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 7, 2009, 7:43 AM

Post #3 of 45 (1116 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774631#action_12774631 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

bq. We did this in Zoie for a while, and it turned out to be a bottleneck - not as much of a bottleneck as continually cloning a bitvector (that was even worse), but still not good. We currently use a bloomfilter on top of an openintset, which performs pretty fantastically: constant-time adds and even-faster constant-time contains() checks, with small size (necessary for the new Reader per query scenario since this requires lots of deep-cloning of this structure).

Good, real-world feedback -- thanks! This sounds like a compelling
approach.

So the SegmentReader still had its full BitVector, but your OpenIntSet
(what exactly is that?) + the bloom filter is then also checked when
you enum the TermDocs? It's impressive this is fast enough... do you
expect this approach to be faster than the paged "copy on write" bit
vector approach?

bq. It also helped to not produce a docIdset iterator using these bits, but instead override TermDocs to be returned on the disk reader, and keep track of it directly there.

The flex API should make this possible, without overriding TermDocs
(just expose the Bits interface).

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 7, 2009, 12:19 PM

Post #4 of 45 (1112 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774662#action_12774662 ]

John Wang commented on LUCENE-1526:
-----------------------------------

The issue of not using a BitSet/BitVector is not simply performance, but also memory cost.
In the case where deletes are sparse (expectedly so) and the index is large, BitSet/BitVector is not a good representation of a DocSet.
For deleted checks, it is usually of this pattern:

Iterate thru my docs, for each doc i
isDeletedCheck(doc i)

So the cost is not really iterating the deleted set per say, it is the check.

We use bloomfilter to filter out negatives (mostly the case) and back it up with the underlying docset (normally an instance of openhashset) for positives.

Code is at:

http://code.google.com/p/zoie/source/browse/trunk/java/proj/zoie/api/impl/util/IntSetAccelerator.java

Feel free to play with it and run some perf numbers on your data.


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 7, 2009, 1:02 PM

Post #5 of 45 (1106 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774667#action_12774667 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

bq. The issue of not using a BitSet/BitVector is not simply performance, but also memory cost.

But don't you cutover all future searches to the newest NRT reader
right away? Ie, those large bit vectors are all transient, so net mem
cost at any given time should be well contained?

Or... do you keep many readers (against different commit points)
around for a longish time (longer than just allowing for the in-flight
searches to complete)?

bq. In the case where deletes are sparse (expectedly so) and the index is large, BitSet/BitVector is not a good representation of a DocSet.

I agree... it's done so "contains" is fast. But we had looked into
"being sparse" (and using an iterator to "and not" the deleted docs,
at the TermDocs level) and the performance was quite a bit worse...


{quote}
For deleted checks, it is usually of this pattern:
Iterate thru my docs, for each doc i
isDeletedCheck(doc i)

So the cost is not really iterating the deleted set per say, it is the check.
{quote}

Right.

bq. We use bloomfilter to filter out negatives (mostly the case) and back it up with the underlying docset (normally an instance of openhashset) for positives.

This makes sense, and it's a nice solution.

But how do you make this transactional? Do you just make a full clone
of IntSetAccelerator (& the underling int set) on every reopen?

Also, what policy/approach do you use to periodically fold the deletes
back into the SegmentReader? As you accumulate more and more deletes
in the IntSet, cloning becomes more costly.

{quote}
Code is at:

http://code.google.com/p/zoie/source/browse/trunk/java/proj/zoie/api/impl/util/IntSetAccelerator.java

Feel free to play with it and run some perf numbers on your data.
{quote}

Thanks!


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 7, 2009, 1:24 PM

Post #6 of 45 (1110 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774671#action_12774671 ]

John Wang commented on LUCENE-1526:
-----------------------------------


We do not hold the deleted set for a long period of time. I agree the memory cost is not a "killer" but it is tremendously wasteful, e.g. 10 M doc index, you have say 2 docs deleted, 0 and 9999999, you are representing it with 5M of memory (where you could have reprsented it with 2 ints, 8 bytes). Sure it is an extremely case, if you look at the avg number of deleted docs vs index size, it is usually sparse. hence we avoided this approach.

We looked at trade-offs between this vs. our approach, for us, it was a worth-while trade-off.

We do not accumulate deletes. Deletes/updates are tracked per batch, and special TermDocs are returned to skip over the deleted/mod set for the given reader.

We simply call delete on the diskreader for each batch and internal readers are refreshed with deletes loaded in background.


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 7, 2009, 3:54 PM

Post #7 of 45 (1109 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774688#action_12774688 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

bq. We do not hold the deleted set for a long period of time. I agree the memory cost is not a "killer" but it is tremendously wasteful, e.g. 10 M doc index, you have say 2 docs deleted, 0 and 9999999, you are representing it with 5M of memory (where you could have reprsented it with 2 ints, 8 bytes). Sure it is an extremely case, if you look at the avg number of deleted docs vs index size, it is usually sparse. hence we avoided this approach.

Actually a 10 M doc index would be 1.25 MB BitVector right?

But, I agree it's wasteful of space when deletes are so
sparse... though it is fast.

So are you using this, only, as your deleted docs? Ie you don't store
the deletions with Lucene? I'm getting confused if this is only for
the NRT case, or, in general.

Have you compared performance of this vs straight lookup in BitVector?

bq. We do not accumulate deletes. Deletes/updates are tracked per batch, and special TermDocs are returned to skip over the deleted/mod set for the given reader.

OK, I think I'm catching up here... so you only open a new reader at
the batch boundary right? Ie, a batch update (all its adds & deletes)
is atomic from the readers standpoint?

bq. We simply call delete on the diskreader for each batch and internal readers are refreshed with deletes loaded in background.

OK so a batch is quickly reopened, using bloom filter + int set for
fast "contains" check for the deletions that occurred during that
batch (and, custom TermDocs that does the "and not deleted"). This
gets you your fast turnaround and decent search performance.

In the BG the deletes are applied to Lucene "for real".

This is similar to Marvin's original tombstone deletions, in that the
deletions against old segments are stored with the new segment, rather
than aggressively pushed to the old segment's bit vectors.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 7, 2009, 4:00 PM

Post #8 of 45 (1104 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774690#action_12774690 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

Alas, I'm confused again. If your reader searching a given batch is
holding deletions against past segments, you can you make a TermDocs
that filters them? (The TermDocs only enumerates the current batch's
docs).

Or: do you make the IntSetAccelerator for each past segments that
received deletions in the current batch?


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 7, 2009, 4:02 PM

Post #9 of 45 (1106 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774691#action_12774691 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

Alas, I'm confused again. If your reader searching a given batch is
holding deletions against past segments, you can you make a TermDocs
that filters them? (The TermDocs only enumerates the current batch's
docs).

Or: do you make the IntSetAccelerator for each past segments that
received deletions in the current batch?


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 7, 2009, 4:21 PM

Post #10 of 45 (1098 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774693#action_12774693 ]

Jake Mannix commented on LUCENE-1526:
-------------------------------------

bq. But, I agree it's wasteful of space when deletes are so
sparse... though it is fast.

It's fast for random access, but it's really slow if you need to make a lot of these (either during heavy indexing if copy-on-write, or during heavy query load if copy-on-reopen).

bq. So are you using this, only, as your deleted docs? Ie you don't store
the deletions with Lucene? I'm getting confused if this is only for
the NRT case, or, in general.

These are only to augment the deleted docs *of the disk reader* - the disk reader isn't reopened at all except infrequently - once a batch (a big enough RAMDirectory is filled, or enough time goes by, depending on configuration) is ready to be flushed to disk, diskReader.addIndexes is called and when the diskReader is reopened, the deletes live in the normal diskReader's delete set. Before this time is ready, when there is a batch in ram that hasn't been flushed, the IntSetAccelerator is applied to the not-reopened diskReader. It's a copy-on-read ThreadLocal.

So I'm not sure if that described it correctly: only the deletes which should have been applied to the diskReader are treated separately - those are basically batched: for T amount of time or D amount of docs (configurable) whichever comes first, they are applied to the diskReader, which knows about Lucene's regular deletions and now these new ones as well. Once the memory is flushed to disk, the in-memory delSet is emptied, and applied to the diskReader using regular apis before reopening.

bq. OK, I think I'm catching up here... so you only open a new reader at
the batch boundary right? Ie, a batch update (all its adds & deletes)
is atomic from the readers standpoint?

Yes - disk reader, you mean, right? This is only reopened at batch boundary.

bq. OK so a batch is quickly reopened, using bloom filter + int set for
fast "contains" check for the deletions that occurred during that
batch (and, custom TermDocs that does the "and not deleted"). This
gets you your fast turnaround and decent search performance.

The reopening isn't that quick, but it's in the background, or are you talking about the RAMDirectory? Yeah, that is reopened per query (if necessary - if there are no changes, of course no reopen), but it is kept very small (10k docs or less, for example). It's actually pretty fantastic performance - check out the zoie perf pages: http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_ZoieLucene24ZoieLucene29LuceneNRT



> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 8, 2009, 12:46 AM

Post #11 of 45 (1086 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774743#action_12774743 ]

John Wang commented on LUCENE-1526:
-----------------------------------

Michael:

I think I confused you by not giving you enough background information.

IntSetAccelerator holds UIDs, not docids. For each segment, the termdocs iterates its docid, we map to its corresponding uid and then check in the set.

Hopefully this clears it up.

-John

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 9, 2009, 5:18 AM

Post #12 of 45 (1061 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774958#action_12774958 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

{quote}
bq. But, I agree it's wasteful of space when deletes are so sparse... though it is fast.

It's fast for random access, but it's really slow if you need to make a lot of these (either during heavy indexing if copy-on-write, or during heavy query load if copy-on-reopen).
{quote}

But how many msec does this clone add in practice?

Note that it's only done if there is a new deletion against that
segment.

I do agree it's silly wasteful, but searching should then be faster
than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the
turnaround time for search perf.

I'd love to see how the worst-case queries (matching millions of hits)
perform with each of these three options.

However, I suspect it's not the clone time that's very costly... I bet
it's the fact that Lucene has to resolve the deletions to docIDs, in
the foreground of reopen, that dominates. And probably also that
Lucene doesn't yet use RAMDir (but LUCENE-1313 is working towards
fixing that).

Ie, Zoie is "late binding" (filters out UIDs as it encounters them
during searching), while Lucene is "early binding" (immediately
resolves UIDs -> docIDs during reopen). And because Zoie does the
"resolve deletions to docIDs" in the BG, it's not on any query's
execution path.

How does Zoie resolve UID -> docID, now? I remember a thread about
this a while back...

Actually one simple fix we could make to Lucene is to resolve
deletions in the foreground, when the deleteDocuments is called.
This'd mean it's the thread that does the updateDocument that pays the
price, rather than a future reopen. Net/net it's a zero sum game
(just distributed the cost from the reopen to the indexing), but it'd
mean the reopen time is minimized, which is clearly the direction we
want to go in. I'll open a new issue.

{quote}
bq. So are you using this, only, as your deleted docs? Ie you don't store the deletions with Lucene? I'm getting confused if this is only for the NRT case, or, in general.

These are only to augment the deleted docs of the disk reader - the disk reader isn't reopened at all except infrequently - once a batch (a big enough RAMDirectory is filled, or enough time goes by, depending on configuration) is ready to be flushed to disk, diskReader.addIndexes is called and when the diskReader is reopened, the deletes live in the normal diskReader's delete set. Before this time is ready, when there is a batch in ram that hasn't been flushed, the IntSetAccelerator is applied to the not-reopened diskReader.
{quote}

I think I now understand it (plus John's comment that these are Zoie's
UIDs not Lucene's docIDs, helped)...

When a doc needs to be updated, you index it immediately into the
RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the
AcceleratedIntSet, and all searches "and NOT"'d against that set. You
don't tell Lucene to delete the old doc, yet.

Periodically, in the BG, you use addIndexes to push the RAMDir to
disk, and, on a perhaps separate schedule, you resolve the deleted
UIDs to docIDs and flush them to disk.

One question: does Zoie preserve Lucene's "point in time" searching?
Is a new deletion immediately visible to all past reopened readers? I
think for Lucene we need to preserve this, so we need a data structure
that can be "efficiently" transactional. I guess we could consider
allowing an NRT to optionally violate this, in which case we wouldn't
need to do any cloning of the deleted docs.

bq. It's a copy-on-read ThreadLocal.

Hmm -- why does each thread make a full clone of the
AcceleratedBitSet? Just for thread safety against additions to the
set? Or is this somehow preserving "point in time"? And it fully
re-clones whenever new updates have been committed to the RAMDir?

bq. It's actually pretty fantastic performance - check out the zoie perf pages: http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_ZoieLucene24ZoieLucene29LuceneNRT

These are great results! If I'm reading them right, it looks like
generally you get faster query throughput, and roughly equal indexing
throughput, on upgrading from 2.4 to 2.9?

Zoie also gets much better performance than raw Lucene NRT, but this
test focuses on reopen performance, I think? Ie, a query reopens the
reader if any new docs were indexed? If you change that to, say,
reopen once per N seconds, I wonder how the results would compare.

One optimization you could make with Zoie is, if a real-time deletion
(from the AcceleratedIntSet) is in fact hit, it could mark the
corresponding docID, to make subsequent searches a bit faster (and
save the bg CPU when flushing the deletes to Lucene).


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 9, 2009, 11:05 AM

Post #13 of 45 (1056 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775076#action_12775076 ]

Jason Rutherglen commented on LUCENE-1526:
------------------------------------------

{quote} check out the zoie perf pages:
http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_Zo
ieLucene24ZoieLucene29LuceneNRT {quote}

What's missing as it pertains to this jira issue is the raw
query speed (meaning the time a query takes to execute, not QPS,
over various percentages of deleted docs), without concurrent
indexing.

If query speed goes down, users need to know by how much.


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 9, 2009, 11:13 AM

Post #14 of 45 (1050 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775082#action_12775082 ]

Jake Mannix commented on LUCENE-1526:
-------------------------------------

I'll try to get those numbers for you, they should be in our logs, and if not, it's easy enough to put them. My guess it that if zoie is doing 7x the QPS, the latency is significantly *less* than the NRT latency, not more, but I could be wrong.

Note that without concurrent indexing, the query speed will be the same, as all docs will be flushed to disk and the isDeleted check reduces to exactly the raw lucene case.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 9, 2009, 11:47 AM

Post #15 of 45 (1053 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775095#action_12775095 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

{quote}
What's missing as it pertains to this jira issue is the raw
query speed (meaning the time a query takes to execute, not QPS,
over various percentages of deleted docs), without concurrent
indexing.
{quote}

I'd love to see this too. Ie, this would measure the query performance when deletes are checked against AcceleratedIntSet, and wouldn't measure the reopen cost (which Lucene NRT will be slower at) at all. Specifically I'd love to see the worst case queries. It's those queries that drive the cutover to a shard'd architecture.

Ie, we need to separately measure query performance vs reopen performance.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 9, 2009, 12:26 PM

Post #16 of 45 (1058 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775114#action_12775114 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

OK I opened LUCENE-2047, to resolve deletedDoc(s) to their docID(s) in the foreground.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 9, 2009, 1:20 PM

Post #17 of 45 (1057 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775151#action_12775151 ]

Jake Mannix commented on LUCENE-1526:
-------------------------------------

bq. But how many msec does this clone add in practice? Note that it's only done if there is a new deletion against that
segment. I do agree it's silly wasteful, but searching should then be faster
than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the
turnaround time for search perf.

I actually don't know for sure if this is the majority of the time, as I haven't actually run both the AcceleratedIntSet or 2.9 NRT through a profiler, but if you're indexing at high speed (which is what is done in our load/perf tests), you're going to be cloning these things hundreds of times per second (look at the indexing throughput we're forcing the system to go through), and even if it's fast, that's costly.

bq. I'd love to see how the worst-case queries (matching millions of hits)
perform with each of these three options.

It's pretty easy to change the index and query files in our test to do that, that's a good idea. You can feel free to check out our load testing framework too - it will let you monkey with various parameters, monitor the whole thing via JMX, and so forth, both for the full zoie-based stuff, and where the zoie api is wrapped purely around Lucene 2.9 NRT. The instructions for how to set it up are on the zoie wiki.

bq. When a doc needs to be updated, you index it immediately into the
RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the
AcceleratedIntSet, and all searches "and NOT"'d against that set. You
don't tell Lucene to delete the old doc, yet.

Yep, basically. The IntSetAccellerator (of UIDs) is set on the (long lived) IndexReader for the disk index - this is why it's done as a ThreadLocal - everybody is sharing that IndexReader, but different threads have different point-in-time views of how much of it has been deleted.

bq. These are great results! If I'm reading them right, it looks like
generally you get faster query throughput, and roughly equal indexing
throughput, on upgrading from 2.4 to 2.9?

That's about right. Of course, the comparison between zoie with either 2.4 or 2.9 against lucene 2.9 NRT is an important one to look at: zoie is pushing about 7-9x better throughput for both queries and indexing than NRT.

I'm sure the performance numbers would change if we allowed not realtimeness, yes, that's one of the many dimensions to consider in this (along with percentage of indexing events which are deletes, how many of those are from really old segments vs. newer ones, how big the queries are, etc...).

bq. One optimization you could make with Zoie is, if a real-time deletion
(from the AcceleratedIntSet) is in fact hit, it could mark the
corresponding docID, to make subsequent searches a bit faster (and
save the bg CPU when flushing the deletes to Lucene).

That sound interesting - how would that work? We don't really touch the disk indexReader, other than to set this modSet on it in the ThreadLocal, where would this mark live?


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 9, 2009, 10:40 PM

Post #18 of 45 (1027 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775319#action_12775319 ]

John Wang commented on LUCENE-1526:
-----------------------------------

bq. I'd love to see how the worst-case queries (matching millions of hits) perform with each of these three options.

I wrote a small program on my laptop, 100 docs in the set, iterates thru 5M numbers and calls contains().
I see 44 ms with BitVector and 64ms with IntAccelerator backed by IntOpenHashSet (from fastUtil)

This is however an extreme case, so test 2, I chose 5000 docs from the set, e.g. mod 1000 to be a candidate for check. And both sets performed equally, around 45ms.

So with the memory cost, and the allocations and clones of the BitVector, I think for us at least, using the IntSetAccelerator works well.

bq. why does each thread make a full clone of the AcceleratedBitSet?

These are for updates, e.g. you updated doc x, it is updated to the ramdir, but it is already on the disk dir. So at query time, you need this set for dup removal.

bq. I'd love to see this too.

Some more details on the test we ran:

NRT - indexing only
***********************************************************
SUMMARY:
***********************************************************
TOTAL TRANSACTIONS: 622201
TOTAL EXECUTIONS: 622201
TOTAL SUCCESSFUL EXECUTIONS: 622201
TOTAL FAILED EXECUTIONS: 0
TOTAL RUNTIME IN MINS: 30.07
INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
***********************************************************

zoie - indexing only
SUMMARY:
***********************************************************
TOTAL TRANSACTIONS: 6265384
TOTAL EXECUTIONS: 6265384
TOTAL SUCCESSFUL EXECUTIONS: 6265384
TOTAL FAILED EXECUTIONS: 0
TOTAL RUNTIME IN MINS: 30.07
INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
***********************************************************

zoie - update
SUMMARY:
***********************************************************
TOTAL TRANSACTIONS: 1923592
TOTAL EXECUTIONS: 1923592
TOTAL SUCCESSFUL EXECUTIONS: 1923592
TOTAL FAILED EXECUTIONS: 0
TOTAL RUNTIME IN MINS: 30.07
INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
***********************************************************

nrt - update

SUMMARY:
***********************************************************
TOTAL TRANSACTIONS: 399893
TOTAL EXECUTIONS: 399893
TOTAL SUCCESSFUL EXECUTIONS: 399893
TOTAL FAILED EXECUTIONS: 0
TOTAL RUNTIME IN MINS: 30.07
INTERVAL FOR AVERAGE TIME CAPTURE IN MINS: 1
***********************************************************

Latencies:

Zoie - insert test: linear growth from 1 ms to 5 ms as index grows in the duration of the test from 0 docs to 660k docs.
Zoie - update test: averaged at 9ms, as index with continuous update and stayed in 1M docs
NRT - insert test: fluctuated between 17 ms to 50 ms as index grows in the duration of the test from 0 docs to 220 docs.
NRT - update test: big peak when query started, latency spiked up to 550ms and then dropped and stayed steadily at 50ms, with continuous updates to stay in 1M docs.

Some observation at the NRT update test, I am seeing some delete issues, e.g. realtime deletes does not seem to reflect, and indexing speed sharply dropped.

It's quite possible that I am not using NRT the most optimal way in my setup. Feel free to run the tests yourself. I'd happy to help with the setup.
One thing with Zoie is that it is a full stream indexing system with a pluggable realtime engine, so you can actually use zoie for perf testing for NRT.

One thing about the test to stress, we are testing realtime updates, so buffered indexing events up and flush once it a while is not realtime, and katta has already achieved good results with batch indexing with just minutes of delay, without making any internal changes to lucene.


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 2:46 AM

Post #19 of 45 (1018 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775371#action_12775371 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

Thanks for running these tests John.

The micro-benchmark of BitVector vs IntAccelerator is nice, but, we
need to see it in the real-world context of running actual worst case
queries.

Zoie aims for super fast reopon time, at the expense of slower query
time since it must double-check the deletions.

Lucene NRT makes the opposite tradeoff.

The tests so far make it clear that Zoie's reopen time is much faster
than Lucene's, but they don't yet measure (as far as I can see) what
cost the double-check for deletions is adding to Zoie for the
worst-case queries.

So if you really need to reopen 100s of times per second, and can
accept that your worst case queries will run slower (we're still not
sure just how much slower), the Zoie approach is best.

If you want full speed query performance, and can instead reopen once
per second or once every 2 seconds, Lucene's approach will be better
(though we still have important fixes to make -- LUCENE-2047,
LUCENE-1313).

Can you describe the setup of the "indexing only "test? Are you doing
any reopening at all?


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 2:48 AM

Post #20 of 45 (1021 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775373#action_12775373 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

{quote}
bq. One optimization you could make with Zoie is, if a real-time deletion (from the AcceleratedIntSet) is in fact hit, it could mark the corresponding docID, to make subsequent searches a bit faster (and save the bg CPU when flushing the deletes to Lucene).

That sound interesting - how would that work? We don't really touch the disk indexReader, other than to set this modSet on it in the ThreadLocal, where would this mark live?
{quote}

Whenever a query happens to "discover" a pending deletion, you could
record somewhat the UID -> docID mapping, and then when it's time to
flush the deletes you don't need to re-resolve the UIDs that the query
had resolved for you. Likely in practice this would be a tiny
speedup, though, so the added complexity is probably not worth it.
Especially since this resolution is done in the BG for Zoie...


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 8:29 AM

Post #21 of 45 (1018 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775930#action_12775930 ]

John Wang commented on LUCENE-1526:
-----------------------------------

bq. we need to see it in the real-world context of running actual worst case queries.

Isn't checking every document in the corpus for deletes the worse case? e.g. first test?

bq. at the expense of slower query time

According to the test, Zoie's query time is faster.

bq. it must double-check the deletions.

True, this double-check is only done for a candidate for a hit from the underlying query. Normally result set is much smaller than the corpus, the overhead is not large. The overhead is 1 array lookup + a delset look up vs. 1 bitvector lookup.

bq. Can you describe the setup of the "indexing only "test?

starting off with an empty index and keep on adding documents, at the same time, for each search request, return a reader for the current state of the indexing. Our test assumes 10 concurrent threads making search calls.


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 12:51 PM

Post #22 of 45 (1009 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776074#action_12776074 ]

Michael McCandless commented on LUCENE-1526:
--------------------------------------------

{quote}
bq. we need to see it in the real-world context of running actual worst case queries.

Isn't checking every document in the corpus for deletes the worse case? e.g. first test?
{quote}

Zoie must do the IntSet check plus the BitVector check (done by
Lucene), right?

Ie comparing IntSet lookup vs BitVector lookup isn't the comparison
you want to do. You should compare the IntSet lookup (Zoie's added
cost) to 0.

So, for a query that hits 5M docs, Zoie will take 64 msec longer than
Lucene, due to the extra check. What I'd like to know is what
pctg. slowdown that works out to be, eg for a simple TermQuery that
hits those 5M results -- that's Zoie's worst case search slowdown.

{quote}
bq. at the expense of slower query time

According to the test, Zoie's query time is faster.
{quote}

The tests so far are really testing Zoie's reopen time vs Lucene's.

To test the query time, you should set up Zoie w/ some pending
deletions, then turn off all indexing, then run the query test.

That would give us both extreme datapoints -- how much slower Lucene
is at reopening (which the current tests show), and how much slower
Zoie is during searching.

{quote}
bq. it must double-check the deletions.

True, this double-check is only done for a candidate for a hit from the underlying query.
{quote}

Right, Zoie's search slowdown is in proportion to the size of the
result set. So eg for hard queries that produce very few results, the
impact will be negligible. For simple queries that produce lots of
results, the relative cost is highest (but we don't yet know what it
actually is in practice).

It could still be neglible, eg since the "if" rarely triggers, the CPU
should be able to predict it just fine.

Net/net, Zoie has faster reopen time than Lucene, but then pays a
higher price (double check for deletion) for every result of every
search. Users need to know what that price really is, in order to
make informed decision about which approach is best for their
situation.

bq. Normally result set is much smaller than the corpus, the overhead is not large.

Well, this is generally app dependent, and it's the net/net worst case
queries that apps need to worry about. Lucene can't [yet] take
avantage of concurrency within a single query, so you're forced to
shard (= big step up in deployment complexity) once your worst case
queries get too slow.

{quote}
bq. Can you describe the setup of the "indexing only "test?

starting off with an empty index and keep on adding documents, at the same time, for each search request, return a reader for the current state of the indexing. Our test assumes 10 concurrent threads making search calls.
{quote}

Oh I see: that test is just adding documents, vs the 2nd test which is
doing updateDocument. Got it.

So, that's interesting... because, with no deletions, thus no
resolving of Term -> docID, and no cloning of the BitVector, Lucene's
reopen is still quite a bit slower.

What differences remain at this point? Just the fact that the RAM dir
is being used to flush the new [tiny] segments? Hmm what about the
merge policy?


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 1:29 PM

Post #23 of 45 (1014 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776085#action_12776085 ]

John Wang commented on LUCENE-1526:
-----------------------------------

bq. Zoie will take 64 msec longer than Lucene, due to the extra check.

That is not true. If you look at the report closely, it is 20ms difference, 64ms is the total size. (after I turned on -server, the diff is about 10ms). This is running on my laptop, hardly a production server.

This is also assuming the entire corpus is returned, where we should really take an average of the result set from the query log.

However, to save this "overhead", using BitVector is wasting a lot of memory, which is expensive to clone, new and gc. In a running system, much of that cost is hard to measure. This is simply a question of trade-offs.

Again, I would suggest to run the tests yourself, afterall, it is open source :) and make decisions for yourself, this way, we can get a better understanding from concrete numbers and scenarios.

BTW, is there a performance benchmark/setup for lucene NRT?

bq. The tests so far are really testing Zoie's reopen time vs Lucene's

That is not true either. This test is simply testing searching with indexing turned on. Not specific to re-open. I don't think the statement that the performance difference is solely due to reopen is substantiated. I am seeing the following with NRT:

e.g.
1) file handle leak - Our prod-quality machine fell over after 1 hr of running using NRT due to file handle leaking.
2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved, and I think that leads to performance differences more than the extra array look.
3) I am seeing also correctness issues as well, e.g. deletes don't get applied correctly. I am not sure about the unit test coverage for NRT to comment specifically.

Again, this can all be specific to my usage of NRT or the test setup. That is why I urge you guys to run our tests yourself and correct us if you see areas we are missing to make a fair comparison.




> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 7:15 PM

Post #24 of 45 (988 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776254#action_12776254 ]

Jason Rutherglen commented on LUCENE-1526:
------------------------------------------

The BitVector memory consumption is 125,000 bytes/122K for 1 million documents. I'd be surprised if copying a byte array has a noticeable performance impact.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 10, 2009, 8:49 PM

Post #25 of 45 (980 views)
Permalink
[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776277#action_12776277 ]

Jake Mannix commented on LUCENE-1526:
-------------------------------------

But Jason, if you're indexing 300 documents a second (possibly all of which are delete+re-add), and querying at a thousand queries a second, how many of these BitVectors are you going to end up making?

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
> Key: LUCENE-1526
> URL: https://issues.apache.org/jira/browse/LUCENE-1526
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.4
> Reporter: Jason Rutherglen
> Priority: Minor
> Attachments: LUCENE-1526.patch
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader.
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs.
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

First page Previous page 1 2 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.