jira at apache
Jul 26, 2012, 8:17 AM
Post #2 of 17
[ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423139#comment-13423139 ]
[jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments
[In reply to]
Shai Erera commented on LUCENE-4258:
There is more to it than just the referenced email. I've had a couple of discussions in the past about this with various people (and it is my fault that I didn't wrote them down and shared them with the rest of you) -- I'll try to summarize below a more detailed proposal:
Add an updateFields method which takes a Constraint and OP (eventually, it might replace today's updateDocument):
* Constraint defines 'which documents' should be updated, and follows today's deleteDocument API (takes Term, Query and arrays of each)
* OP defines the actual update to do on those documents:
** It has a TYPE, with 3 values (At least for now):
**# REPLACE_DOC -- replaces an entire document (essentially what updateDocument is today)
**# UPDATE_FIELD -- incrementally update a field
**# REPLACE_FIELD -- replaces a field entirely
** In addition, it takes a Field (or Iterable) to remove/add.
** In light of the recent changes to IndexableField and friends, perhaps what it should take is a concrete UpdateField with a boolean specifying whether to add/remove its content. Suggestions are welcome !
The idea is to create StackedSegments, which, well, stack on top of current segments. The inspiration came from deletes, which can be viewed as a segment stacked on an existing segment, that marks which documents are deleted.
Following that semantics, a segment could be comprised of these files:
* Layer 1: _0.prx, _0.fnm, _0.fdt ...
* Layer 2: _0_1.prx, _0_1.fdt (no updates to .fnm) -- override/merge info from layer 1
* Layer 3: _0_2.prx -- override/merge info from layer 2
* Layer 4: _0_1.del -- deletes are *always* the last layer, irregardless of their 'layer id' -- _0_1.del overrides everything, even _0_100.prx.
** And they can be stacked on themselves as today, e.g. _0_2.del etc.
I believe that we'll need an UpdateCodec or something ... this is the part of the internal API that we still need to understand better. Help from folks like you Robert will be greatly appreciated !
Two options to encode the posting lists:
* field:value --> +1, -5, +8, +12, -17 ... (simple, but cannot be encoded efficiently
*# +field:value --> 1, 8, 12
*# -field:value --> 5, 17
Ideally, the way incremental updates will be applied will follow how deletes are applied today:
* An update always applies to *all* documents that are flushed
* And to all documents currently in the RAM buffer
* But never to documents that are indexed later
Again, this is an internal detail that I'd appreciate if someone can give us a pointer to where that happens in the code today (now with concurrent flushing). I remember PackedDeletes existed at some point, has that changed?
If it's a new Codec, then SegmentReader may not even need to change ...
The REPLACE_FIELD OP is tricky ... perhaps it's like how deletes are materialized on disk -- as a sparse bit vector that marks the documents that are no longer associated with it ...
I also think that we should introduce this feature in steps:
# Support only fields that omit TFAP (i.e. DOCS_ONLY). This is very valuable for fields like ACL, TAGS, CATEGORIES etc.
** Ideally, the app would just need to say "add/remove ACL:SHAI to/from document X", rather than passing the entire list of ACLs every on every update operation.
** This I believe is also the most common use case for incremental field updates
# Support stored fields, whether as part of (1) or a follow-on, but adding TAG:LUCENE to the postings, but not the stored fields, is limiting ...
# Support terms with positions, but no norms. What I'm thinking about are terms that store stuff in the payload, but don't care about the positions themselves. An example are the category dimensions of the facet module, which stores category ordinals in the payload
#* Positions are tricky, and we'll need to do this carefully, I know. But I don't rule it out at this point.
# Then, support fields with norms. I get your concern Robert, and I agree it's a challenge, hence why I leave it to last. The scenario I have in mind is: a search engine that lets you comment on a result or tag it, and the comment/tag should be added to the document's 'catchall' field for later searches. I think it's a valuable scenario, and this is something I'd like to support. If we cannot find a way to deal with it and the norms, then I see two options:
## Document a limitation to updating a field with norms, at your own risk.
## Enforce REPLACE_FIELD OP on fields with norms.
* Since norms are under DocValues now, maybe that's solvable, I don't know. At the moment I think that we have a lot to do before we worry about norms ...
* I also think that we should start with the simpler ADD_FIELD operation, and not support REMOVE_FIELD ... really to keep things simple at start.
I suggest we do this work in a dedicated branch of course. Ideally, we can port everything to 4.x at some point, as I think most of the changes are internal details ...
> Incremental Field Updates through Stacked Segments
> Key: LUCENE-4258
> URL: https://issues.apache.org/jira/browse/LUCENE-4258
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Sivan Yogev
> Original Estimate: 2,520h
> Remaining Estimate: 2,520h
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene