Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Jul 26, 2012, 6:33 AM

Post #1 of 17 (146 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423055#comment-13423055 ]

Robert Muir commented on LUCENE-4258:
-------------------------------------

A few things I dont understand:
* when updating a document, how do you know which terms to apply negative postings to?
* how can the idea of updating "individual terms" work as far as length normalization information? How will that be reconciled?

In truth I think term is too fine-grained of a level for updates in lucene because of norms,
only updating whole fields of a document will really work (as then the norm can simply be recomputed and replaced).



> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 26, 2012, 8:17 AM

Post #2 of 17 (149 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423139#comment-13423139 ]

Shai Erera commented on LUCENE-4258:
------------------------------------

There is more to it than just the referenced email. I've had a couple of discussions in the past about this with various people (and it is my fault that I didn't wrote them down and shared them with the rest of you) -- I'll try to summarize below a more detailed proposal:

*API*
Add an updateFields method which takes a Constraint and OP (eventually, it might replace today's updateDocument):
* Constraint defines 'which documents' should be updated, and follows today's deleteDocument API (takes Term, Query and arrays of each)
* OP defines the actual update to do on those documents:
** It has a TYPE, with 3 values (At least for now):
**# REPLACE_DOC -- replaces an entire document (essentially what updateDocument is today)
**# UPDATE_FIELD -- incrementally update a field
**# REPLACE_FIELD -- replaces a field entirely
** In addition, it takes a Field[] (or Iterable) to remove/add.
** In light of the recent changes to IndexableField and friends, perhaps what it should take is a concrete UpdateField with a boolean specifying whether to add/remove its content. Suggestions are welcome !

*Implementation*
The idea is to create StackedSegments, which, well, stack on top of current segments. The inspiration came from deletes, which can be viewed as a segment stacked on an existing segment, that marks which documents are deleted.

Following that semantics, a segment could be comprised of these files:
* Layer 1: _0.prx, _0.fnm, _0.fdt ...
* Layer 2: _0_1.prx, _0_1.fdt (no updates to .fnm) -- override/merge info from layer 1
* Layer 3: _0_2.prx -- override/merge info from layer 2
* Layer 4: _0_1.del -- deletes are *always* the last layer, irregardless of their 'layer id' -- _0_1.del overrides everything, even _0_100.prx.
** And they can be stacked on themselves as today, e.g. _0_2.del etc.
I believe that we'll need an UpdateCodec or something ... this is the part of the internal API that we still need to understand better. Help from folks like you Robert will be greatly appreciated !

Two options to encode the posting lists:
* field:value --> +1, -5, +8, +12, -17 ... (simple, but cannot be encoded efficiently
*# +field:value --> 1, 8, 12
*# -field:value --> 5, 17

Ideally, the way incremental updates will be applied will follow how deletes are applied today:
* An update always applies to *all* documents that are flushed
* And to all documents currently in the RAM buffer
* But never to documents that are indexed later

Again, this is an internal detail that I'd appreciate if someone can give us a pointer to where that happens in the code today (now with concurrent flushing). I remember PackedDeletes existed at some point, has that changed?

If it's a new Codec, then SegmentReader may not even need to change ...

The REPLACE_FIELD OP is tricky ... perhaps it's like how deletes are materialized on disk -- as a sparse bit vector that marks the documents that are no longer associated with it ...

I also think that we should introduce this feature in steps:
# Support only fields that omit TFAP (i.e. DOCS_ONLY). This is very valuable for fields like ACL, TAGS, CATEGORIES etc.
** Ideally, the app would just need to say "add/remove ACL:SHAI to/from document X", rather than passing the entire list of ACLs every on every update operation.
** This I believe is also the most common use case for incremental field updates
# Support stored fields, whether as part of (1) or a follow-on, but adding TAG:LUCENE to the postings, but not the stored fields, is limiting ...
# Support terms with positions, but no norms. What I'm thinking about are terms that store stuff in the payload, but don't care about the positions themselves. An example are the category dimensions of the facet module, which stores category ordinals in the payload
#* Positions are tricky, and we'll need to do this carefully, I know. But I don't rule it out at this point.
# Then, support fields with norms. I get your concern Robert, and I agree it's a challenge, hence why I leave it to last. The scenario I have in mind is: a search engine that lets you comment on a result or tag it, and the comment/tag should be added to the document's 'catchall' field for later searches. I think it's a valuable scenario, and this is something I'd like to support. If we cannot find a way to deal with it and the norms, then I see two options:
## Document a limitation to updating a field with norms, at your own risk.
## Enforce REPLACE_FIELD OP on fields with norms.

* Since norms are under DocValues now, maybe that's solvable, I don't know. At the moment I think that we have a lot to do before we worry about norms ...

* I also think that we should start with the simpler ADD_FIELD operation, and not support REMOVE_FIELD ... really to keep things simple at start.

I suggest we do this work in a dedicated branch of course. Ideally, we can port everything to 4.x at some point, as I think most of the changes are internal details ...

> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 26, 2012, 8:41 AM

Post #3 of 17 (151 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423159#comment-13423159 ]

Robert Muir commented on LUCENE-4258:
-------------------------------------

I don't think I'm sold on introducing the feature in steps.

I think its critical for something of this magnitude that we figure out the design totally, up-front,
so it will work for the major use-cases. I think its fine to implement in steps if we need though.

Honestly I think we should throw it all out on the table and get to the real problems I think
that most people face today:
# For many document sizes, use-cases (especially rapidly changing stuff): The real problem is not the
speed of lucene reindexing the document, its that the user must rebuild the entire document. Solr solved
this by providing an option where you just say "update field X" and internally it reindexes the
document from stored fields (for that feature to work, the whole thing must be stored). We shouldn't
discard the possibility of implementing cleaner support for a solution like this, which wouldnt
complicate indexwriter at all.
# A second problem (not solved by the above) is that many people are using scoring factors with a variety
of signals and these are changing often. I think unfortunately, people are often putting these in
a normal indexed field and uninverting these on the fieldcache, requiring the whole document to
be reindexed just because of how they implemented the scoring factor. People could instead solve this
by putting their apps primary key into a docvalues field, allowing them to keep these scoring factors
completely external to lucene (e.g. their own array or whatever), indexed by their own primary key. But
the problem is I think people want lucene to manage this, they don't want to implement themselves whats
necessary to make it consistent with commits etc.

So we can look at several approaches to solving this stuff. I feel like both these problems could be
solved via a contrib module without modifying indexwriter at all for many use cases: maybe better if
we go for more tight integration. And with those simple approaches I describe above, searching doesn't
get any slower.

But if we really feel like we need a "full incremental update API" (i know there are a few use cases
where it can help, I'm not discarding that), then I feel like there are a few things I want:
* I want scoring to be correct: this is a must. If we provide a incremental update API on IW and it doesnt
achieve the same thing as updateDocument today, then its broken. But I think its ok for things to
be temporarily off (as long as this is in a consistent way) until merging takes place, just like
deletes today.
* I want to know for any incremental update API, the cost to search performance.
I want to know, at what document size is any incremental update API actually faster than us just
reindexing the document internally, and how much faster is it? I also want us to consider that
compared to the slowdown in search performance. We should know what the tradeoffs are before committing
such APIs.

I strongly feel like if we just add these incremental APIs to indexwriter without being careful about these
things, the end result could be that people use them without thinking and end out with slower search
and worse relevance, thats why I am asking so many questions.


> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 26, 2012, 10:05 AM

Post #4 of 17 (142 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423240#comment-13423240 ]

Shai Erera commented on LUCENE-4258:
------------------------------------

I think it's ok if we introduced IFU for DOCS_ONLY at first, throwing exceptions otherwise. E.g., UpdateField override setOmitNorms and such and throws UOE... at first.

Everything else will still work as it is today...

Codecs didn't handle all segment files first... stored fields and such were added later. I do agree though that we should keep in mind the full range of scenarios.

Sorry for the short response, JIRA isn't great with smart phones: -).

> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 26, 2012, 10:27 AM

Post #5 of 17 (148 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423258#comment-13423258 ]

Robert Muir commented on LUCENE-4258:
-------------------------------------

{quote}
Codecs didn't handle all segment files first... stored fields and such were added later. I do agree though that we should keep in mind the full range of scenarios.
{quote}

I don't think thats really comparable at all, for two reasons:
1. Codecs can be considered a "rote" refactoring of the XXXWriter in 3.x. I'm not trying to diminish the value but its just an introduced abstraction layer. Something like this is different in that its algorithmic.
2. The fact that Codecs only handled postings at first wasn't easy to fix after they were introduced as postings-only. Once they handled postings initially, this was a significant refactoring.

I'm not trying to pick on your proposal, I'm just saying there are things I don't like about the design.
* I think that updating individual terms is a fringe use-case, and not the major use case for incremental updates, which is to update the contents of one field, without reindexing the entire document. This was also noted by someone else on the discussion thread. This issue seems to be solely about supporting the 'tagging' use case, which is just one of many.
* I think requiring no positions, no frequencies, and no norms makes it even more fringe. This means its not really useful for any search purposes. And we are a search engine library.
* I think that negatives won't compress well, as in general compression algorithms for IR in the last years focus on positive integers.
* I think merging the postings will be slow: I don't like the tradeoff of slowing down searching so much for what I'm not even sure will be a significant speedup to indexing.


> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 26, 2012, 12:52 PM

Post #6 of 17 (144 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423397#comment-13423397 ]

Shai Erera commented on LUCENE-4258:
------------------------------------

bq. ...which is to update the contents of one field, without reindexing the entire document

I agree, but I distinguish between two operations:
# replacing the content of a field entirely with a new content (or remove the field)
# update the field's content by adding/removing individual terms

bq. I think requiring no positions, no frequencies, and no norms makes it even more fringe. This means its not really useful for any search purposes. And we are a search engine library.

I disagree. Where I come from, the most common use case where such operation will be useful is when a single change affects hundreds and sometimes thousands of documents. An example is a document library like application which manages folders with ACLs. You can add an ACL to a top-level folder and it affects the entire documents and folder beneath it. That results in reindexing, sometimes, a huge amount of documents.

I don't diminish the use case of updating a field for scoring purposes, not at all. Just saying that starting by supporting one use case is more than supporting no use case.

Now, and this probably stems from my lack of understanding of the Lucene internals -- I see "supporting terms that omit TFAP" as a starting point because that is the easiest case, and even that requires a lot of understanding of the internals. After we do that, I'll feel more comfortable discussing other types of updates for other field types ... at least, I'll feel that I have more intelligent things to say :).

Regarding your other concerns, I share them with you, and we of course need to benchmark everything. I don't know how this affect search or not. But those updates will get merged away when segments are merged, so while I'm sure search will be affected, it's not for eternity - only until that segment is merged. And, I think we need to add capability to MergePolicy to findSegmentsForMergeUpdates, just like we expungeDeletes.

If the first step means that in order to update a field used for scoring (i.e. w/ norms) means that you need to replace the content of the field entirely by a new content, I'm ok with it. As one esteem member of this community always says "progress, not perfection" - I'm totally soled for that !

> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 27, 2012, 4:12 AM

Post #7 of 17 (146 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423798#comment-13423798 ]

Robert Muir commented on LUCENE-4258:
-------------------------------------

I don't think its progress if we add a design that *can only work with omitTFAP and no norms*,
and can only update individual terms, but not fields.

it means to support these things we have to also totally clear out whats there, and then introduce a new design

In fact this issue shouldnt be called incremental field updates: its not. its "term updates" or something else entirely different.


> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 27, 2012, 4:39 AM

Post #8 of 17 (141 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423813#comment-13423813 ]

Shai Erera commented on LUCENE-4258:
------------------------------------

bq. can only update individual terms, but not fields

Who said that? One of the update operations I listed is REPLACE_FIELD, which means replace the field's content entirely with the new content.

bq. I don't think its progress if we add a design that can only work with omitTFAP and no norms

I never said that will be the design. What I said is that in order to update a field at the term level, we'll start with such fields only. The rest of the fields (i.e. w/ norms, payloads and what not) will be updated through REPLACE_FIELD. The way I see it, we still address all the issues, only for some fields we require a whole field replace, and not an optimized term-based update. That can be improved further along, or not.

bq. In fact this issue shouldnt be called incremental field updates: its not. its "term updates" or something else entirely different.

That is my idea of incremental field updates and I'm not sure that it's not your idea as well :). You seem to only want to support REPLACE_FIELD, while I say that for some field types we can support UPDATE_FIELD (i.e. at the term level), that's it !

> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 29, 2012, 9:12 PM

Post #9 of 17 (145 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424683#comment-13424683 ]

Shai Erera commented on LUCENE-4258:
------------------------------------

I had a chat about this with Robert a couple of days ago, figured it'll be easier to discuss the differences in approaches/opinions, rather than back and forth JIRA comments. Our idea of incremental field updates is not much different. Robert stressed that in his opinion we should tackle first the REPLACE_FIELD operation, which replaces the content of a field entirely by a new content, because he believes that's the most common scenario (i.e., update the title field). I believe that term-based updates are very important too, at least in the scenarios that I face (i.e. adding/removing one ACL, one social tag, one category etc.).

We concluded that the design should take REPLACE_FIELD into consideration from the get go. Whether we'll also implement UPDATE_FIELD (or UPDATE_TERMS as a better name?) depends on the complexity of it. Because initially UPDATE_TERMS can be implemented through REPLACE_FIELD, so we don't lose functionality. UPDATE_TERMS can come later as an optimization.

Robert, if I misrepresented our conclusions, please correct me.

> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 30, 2012, 12:56 AM

Post #10 of 17 (141 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424736#comment-13424736 ]

Sivan Yogev commented on LUCENE-4258:
-------------------------------------

Seems like in any case we need to have a separation between fields given with UPDATE_FIELD and REPLACE_FIELD. There are two ways I could think of for implementing this separation.

The first is at the segment level, where we can have separate "update" and "replace" segments, where the semantic is that a field in an "update" segment is merged with fields in previous segments, while a field in a "replace" segment ignores previous segments.

The second option is to separate at the field level, choosing one type as the default behavior (maybe this can be configurable) and marking the fields of the non-default type by altering the field name or some other solution.

I lean towards the segment level separation, since it requires less conventions and will probably require less work for Codec implementations to handle.

> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 30, 2012, 4:11 AM

Post #11 of 17 (156 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424793#comment-13424793 ]

Sivan Yogev commented on LUCENE-4258:
-------------------------------------

BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS.

> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 30, 2012, 4:54 AM

Post #12 of 17 (146 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424803#comment-13424803 ]

Michael McCandless commented on LUCENE-4258:
--------------------------------------------

bq. BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS.

+1

I think this design sounds good! REPLACE_FIELDS should easily be able
to update norms correctly, right? Because the full per-field stats
are recomputed from scratch. So then scores should be identical:
should be a nice simple testcase to create :)

I don't see how UPDATE_FIELDS can do so unless we somehow save the raw
stats (FieldInvertState) in the index. It seems like UPDATE_FIELDS
should forever be limited to DOCS_ONLY, no norms updating? Positions
also seems hard to update, and if the only reason to do so is for
payloads... seems like the app should be using doc values instead, and
we should (eventually) make doc values updatable?.

I do think this is a common use case (ACLs, filters, social
tags)... though I'm not sure how bad it'd really be in practice for
the app to simply REPLACE_FIELDS with the full set of tags. I guess
if we build REPLACE_FIELDS first we can test that.

The implementation should be able to piggy-back on all the
buffering/tracking we currently do for buffered deletes.

I think this change should live entirely above Codec? Ie Codec just
thinks it's writing a segment, not knowing if that segment is the base
segment, or one of the stacked ones. If the +postings and -postings
are simply 2 terms then the Codec need not know...

Seems like only SegmentInfos needs to track how segments stack up, and
then I guess we'd need a new StackedSegmentReader that is atomic,
holds N SegmentReaders, and presents the merged codec APIs by merging
down the stack on the fly? I suspect this (having to use a PQ to
merge the docIDs in the postings) will be a huge search performance
hit....

I think UnionDocs/AndPositionsEnum (in MultiPhraseQuery.java) is
already doing what we want? (Except it doesn't handle negative
postings).

What about merging? Seems like the merge policy should know about
stacking and should sometimes (aggressively?) merge a stack down?


> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 30, 2012, 5:03 AM

Post #13 of 17 (140 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424806#comment-13424806 ]

Erick Erickson commented on LUCENE-4258:
----------------------------------------

How does this relate (if at all, I confess I just looked at the title) to Andrzej's proposal here? https://issues.apache.org/jira/browse/LUCENE-3837

> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 30, 2012, 7:03 AM

Post #14 of 17 (141 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424870#comment-13424870 ]

Robert Muir commented on LUCENE-4258:
-------------------------------------

{quote}
I don't see how UPDATE_FIELDS can do so unless we somehow save the raw
stats (FieldInvertState) in the index. It seems like UPDATE_FIELDS
should forever be limited to DOCS_ONLY, no norms updating?
{quote}

Actually its DOCS_ONLY plus OMIT_NORMS.

Anyway why not start with updating the entire contents of a field as I suggested?
It seems to be the most general solution, and there is some discussion about how scoring can
work correctly on LUCENE-3837 (the stats, not just norms).

{quote}
I do think this is a common use case (ACLs, filters, social
tags)... though I'm not sure how bad it'd really be in practice for
the app to simply REPLACE_FIELDS with the full set of tags. I guess
if we build REPLACE_FIELDS first we can test that.
{quote}

This is why we should do 'replace contents of a field' first. Its the most well-defined and general.

Its also still controversial, myself I'm not convinced it will actually help most people that think
they want it, I think it will just slow down searches.


> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 30, 2012, 11:15 AM

Post #15 of 17 (141 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425079#comment-13425079 ]

Shai Erera commented on LUCENE-4258:
------------------------------------

bq. BTW, since the new method is to handle multiple fields (as the name suggests), the operation descriptions should also be in plural: UPDATE_FIELDS and REPLACE_FIELDS.

Ok. I think to not confuse though, we should call it UPDATE_TERMS (not FIELDS). Then someone can updateFields() twice, once for all the fields which he wants to REPLACE and second for the fields he just wants to update their terms.

bq. What about merging?

I wrote about it above -- MergePolicy will need to take care of these stacked segments, and we'll add something like ,merge/expungeFieldUpdates so the app can call it deliberately.

bq. seems like the app should be using doc values instead, and we should (eventually) make doc values updatable?

I agree we should not UPDATE_TERMS fields that record norms. I'm not sure that every use case of storing info in the payload today can be translated to using DocValues, so I don't want to limit things. So, let's start with UPDATE_TERMS taking care of fields that omit norms. Then, if we handle payload or not for few use cases, can become as an optimization later on. In the meanwhile, apps will just need to replace the entire field.

Progress, not perfection ! :)

> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 1, 2012, 11:58 AM

Post #16 of 17 (121 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426821#comment-13426821 ]

Sivan Yogev commented on LUCENE-4258:
-------------------------------------

bq. How does this relate (if at all, I confess I just looked at the title) to Andrzej's proposal here?

The basic idea is the same. One major difference is that in Andrzej's proposal the stacked updates are added to a new index with different doc IDs, and then the SegmentReader needs to map to the original doc IDs. The plan in this proposal (Shai correct me if I'm wrong) is for the stacked updates not to be stand alone segments. Although they will have the structure of regular segments they will be tightly coupled with the original segment, with doc IDs matching those of the original segment.


> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Aug 7, 2012, 5:51 AM

Post #17 of 17 (100 views)
Permalink
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430322#comment-13430322 ]

Sivan Yogev commented on LUCENE-4258:
-------------------------------------

Working on the details, it seems that we need to add a new layer of information for stacked segments. For each field that was added with REPLACE_FIELDS, we need to hold the documents in which a replace took place, with the number of the latest generation that had the replacement. Name this list the "generation vector". That way, TermDocs provided by StackedSegmentReader for a certain term is a special merge of that term's TermDocs for all stacked segments. The "special" part about it is that we ignore occurrences from documents in which the term's field was replaced in a later generation.

An example. Assume we have doc 1 with title "I love bananas" and doc 2 with title "I love oranges", and the segment is flushed. We will have the following base segment (ignoring positions):

bananas: doc 1
I: doc1, doc 2
love: doc 1, doc 2
oranges: doc2

Now we add to doc 1 additional title field "I hate apples", and replace the title of doc 2 with "I love lemons", and flush. We will have the following segment for generation 1:

apples: doc 1
hate: doc 1
I: doc 1, doc 2
lemons: doc 2
love: doc 2
generation vector for field "title": (doc 2, generation 1)

TermDocs for a few terms:
* title:bananas : {1}, uses the TermDocs of the base segment and not affected by the field title generation vector.
* title:oranges : {}, uses the TermDocs of the base segment, doc 2 title affected for generations < 1, and the generation is 0.
* title:lemons : {2}, uses the TermDocs of generation 1. Doc 2 title affected for generations < 1, but the term appears in generation 1.
* title:love : {1,2}, uses the TermDocs of both segments. Doc 2 title affected for generations < 1, but the term appears in generation 1.

I propose to initially use PackedInts for the generation vector, since we know how many generations the curent segment has upon flushing. Later we might consider special treatment for sparse vectors.


> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.