Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

adding "explicit commits" to Lucene?

 

 

First page Previous page 1 2 3 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


lucene at mikemccandless

Jan 14, 2007, 1:36 PM

Post #1 of 62 (4800 views)
Permalink
adding "explicit commits" to Lucene?

Team,

I've been struggling to find a clean solution for LUCENE-710, when I
thought of a simple addition to Lucene ("explicit commits") that would
I think resolve LUCENE-710 and would fix a few other outstanding
issues when readers are using a "live" index (being updated by a
writer).

The basic idea is to add an explicit "commit" operation to Lucene.

This is the same nice feature Solr has, but just a different
implementation (in Lucene core, in a single index, instead). The
commit makes a "point in time" snapshot (term borrowed from Solr!)
available for searching.

The implementation is surprisingly simple (see below) and completely
backwards compatible.

I'd like to get some feedback on the idea/implementation.


Details...: right now, Lucene writes a new segments_N file at various
times: when a writer (or reader that's writing deletes/norms) needs to
flush its pending changes to disk; when a writer merges segments; when
a writer is closed; multiple times during optimize/addIndexes; etc.

These times are not controllable / predictable to the developer using
Lucene.

A new reader always opens the last segments_N written, and, when a
reader uses isCurrent() to check whether it should re-open (the
suggested way), that method always returns false (meaning you should
re-open) if there are any new segments_N files.

So it's somewhat uncontrollable to the developer what state the index
is in when you [re-]open a reader.

People work around this today by adding logic above Lucene so that the
writer separately communicates to readers when is a good time to
refresh. But with "explicit commits", readers could instead look
directly at the index and pick the right segments_N to refresh to.

I'm proposing that we separate the writing of a new segments_N file
into those writes that are done automatically by Lucene (I'll call
these "checkpoints") from meaningful (to the application) commits that
are done explicitly by the developer at known times (I'll call this
"committing a snapshot"). I would add a new boolean mode to
IndexWriter called "autoCommit", and a new public method "commit()" to
IndexWriter and IndexReader (we'd have to rename the current protected
commit() in IndexReader)

When autoCommit is true, this means every write of a segments_N file
will be "commit a snapshot", meaning readers will then use it for
searching. This will be the default and this is exactly how Lucene
behaves today. So this change is completely backwards compatible.

When autoCommit is false, then when Lucene chooses to save a
segments_N file it's just a "checkpoint": a reader would not open or
re-open to the checkpoint. This means the developer must then call
IndexWriter.commit() or IndexReader.commit() in order to "commit a
snapshot" at the right time, thereby telling readers that this
segments_N file is a valid one to switch to for searching.


The implementation is very simple (I have an initial coarse prototype
working with all but the last bullet):

* If a segments_N file is just a checkpoint, it's named
"segmentsx_N" (note the added 'x'); if it's a snapshot, it's named
"segments_N". No other changes to the index format.

* A reader by default opens the latest snapshot but can optionally
open a specific N (segments_N) snapshot.

* A writer by default starts from the most recent "checkpoint" but
may also take a specific checkpoint or snapshot point N
(segments_N) to start from (to allow rollback).

* Change IndexReader.isCurrent() to see if there are any newer
snapshots but disregard newer checkpoints.

* When a writer is in autoCommit=false mode, it always writes to the
next segmentsx_N; else it writes to segments_N.

* The commit() method would just write to the next segments_N file
and return the N it had written (in case application needs to
re-use it later).

* IndexFileDeleter would need to have a slightly smarter policy when
autoCommit=false, ie, "don't delete anything referenced by either
the past N snapshots or if the snapshot was obsoleted less than X
minutes ago".


I think there are some compelling things this could solve:

* The "delete then add" problem (really a special but very common
case of general transactions):

Right now when you want to update a bunch of documents in a Lucene
index, it's best to open a reader, do a "batch delete", close the
reader, open a writer, do a "batch add", close the writer. This
is the suggested way.

The open risk here is that a reader could refresh at any time
during these operations, and find that a bunch of documents have
been deleted but not yet added again.

Whereas, with autoCommit false you could do this entire operation
(batch delete then batch add), and then call the final commit() in
the end, and readers would know not to re-open the index until
that final commit() succeeded.

* The "using too much disk space during optimize" problem:

This came up on the user's list recently: if you aggressively
refresh readers while optimize() is running, you can tie up much
more disk space than you'd expect, because your readers are
holding open all the [possibly very large] intermediate segments.

Whereas, if autoCommit is false, then developer calls optimize()
and then calls commit(), the readers would know not to re-open
until optimize was complete.

* More general transactions:

It has come up a fair number of times how to make Lucene
transactional, either by itself ("do the following complex series
of index operations but if there is any failure, rollback to the
start, and don't expose result to searcher until all operations
are done") or as part of a larger transaction eg involving a
relational database.

EG, if you want to add a big set of documents to Lucene, but not
make them searchable until they are all added, or until a specific
time (eg Monday @ 9 AM), you can't do that easily today but it
would be simple with explicit commits.

I believe this change would make transactions work correctly with
Lucene.

* LUCENE-710 ("implement point in time searching without relying on
filesystem semantics"), also known as "getting Lucene to work
correctly over NFS".

I think this issue is nearly solved when autoCommit=false, as long
as we can adopt a shared policy on "when readers refresh" to match
the new deletion policy (described above). Basically, as long as
the deleter and readers are playing by the same "refresh rules"
and the writer gives the readers enough time to switch/warm, then
the deleter should never delete something in use by a reader.



There are also some neat future things made possible:

* The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
could have a more efficient implementation (just like Solr) when
autoCommit is false, because deletes don't need to be flushed
until commit() is called. Whereas, now, they must be aggressively
flushed on each checkpoint.

* More generally, because "checkpoints" do not need to be usable by
a reader/searcher, other neat optimizations might be possible.

EG maybe the merge policy could be improved if it knows that
certain segments are "just checkpoints" and are not involved in
searching.

* I could simplify the approach for my recent addIndexes changes
(LUCENE-702) to use this, instead of it's current approach (wish I
had thought of this sooner: ugh!.).

* A single index could hold many snapshots, and, we could enable a
reader to explicitly open against an older snapshot. EG maybe you
take weekly and a monthly snapshot because you sometimes want to
go back and "run a search on last week's catalog".

Feedback?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


chuck at manawiz

Jan 14, 2007, 10:16 PM

Post #2 of 62 (4724 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

Micahel,

This seems to me to be a great idea, especially the ability to support
index transactions.

ParallelWriter (original implementation in LUCENE-600 -- I have a much
better one now) provides a companion writer to ParallelReader. It takes
a Document, breaks it up into subdocuments associated with parallel
indexes that partition the fields, and writes those subdocuments into
their respective parallel indexes. ParallelReader requires that the
parallel indexes remain doc-id synchronized, which severely limits the
opportunity for concurrent writing due to the possibility of the reader
reopening when the indexes are out of sync (more Documents in one than
another) and due to errors writing some subdocument(s) of a set when the
others succeed.

The new version of ParallelWriter, not in jira yet, provides more
concurrency and provides better error recovery than the version there
now, but it still limited in possible concurrency and in the worst case
(when other recovery options fail) may have to fully optimize the
indexes to back out the case were only a subset of the subdocuments
derived from a given document fail to write. The root cause for the
horrible error recovery case is the uncontrollable and unrevertable
merging that may arise from adding a single document.

I believe what you propose would provide the foundation to fully solve
these problems efficiently, yielding much more concurrency and
guaranteeing efficient error recovery in ParallelWriter. Also it would
simplify some other cases where transactional integrity is essential in
my current app. So this really sounds great.

Possibly related, one of the ways I improved concurrency in
ParallelWriter was to break up IndexWriter.addDocument() into one method
to invert the document and create a RAMSegment and a second method that
takes the RAMSegment and merges it into the index. This allows
inversions to be processed in parallel, while merging is already a
critical section. (Side thought: I've been wondering how hard it would
be to make merging not a critical section). I had thought of the method
to take the RAMSegment and merge it to be the "commit" part of
addDocument().

Your notion of commit is much better and more flexible, but perhaps you
could include this inversion/merge separation as well?

Chuck


Michael McCandless wrote on 01/14/2007 11:36 AM:
> Team,
>
> I've been struggling to find a clean solution for LUCENE-710, when I
> thought of a simple addition to Lucene ("explicit commits") that would
> I think resolve LUCENE-710 and would fix a few other outstanding
> issues when readers are using a "live" index (being updated by a
> writer).
>
> The basic idea is to add an explicit "commit" operation to Lucene.
>
> This is the same nice feature Solr has, but just a different
> implementation (in Lucene core, in a single index, instead). The
> commit makes a "point in time" snapshot (term borrowed from Solr!)
> available for searching.
>
> The implementation is surprisingly simple (see below) and completely
> backwards compatible.
>
> I'd like to get some feedback on the idea/implementation.
>
>
> Details...: right now, Lucene writes a new segments_N file at various
> times: when a writer (or reader that's writing deletes/norms) needs to
> flush its pending changes to disk; when a writer merges segments; when
> a writer is closed; multiple times during optimize/addIndexes; etc.
>
> These times are not controllable / predictable to the developer using
> Lucene.
>
> A new reader always opens the last segments_N written, and, when a
> reader uses isCurrent() to check whether it should re-open (the
> suggested way), that method always returns false (meaning you should
> re-open) if there are any new segments_N files.
>
> So it's somewhat uncontrollable to the developer what state the index
> is in when you [re-]open a reader.
>
> People work around this today by adding logic above Lucene so that the
> writer separately communicates to readers when is a good time to
> refresh. But with "explicit commits", readers could instead look
> directly at the index and pick the right segments_N to refresh to.
>
> I'm proposing that we separate the writing of a new segments_N file
> into those writes that are done automatically by Lucene (I'll call
> these "checkpoints") from meaningful (to the application) commits that
> are done explicitly by the developer at known times (I'll call this
> "committing a snapshot"). I would add a new boolean mode to
> IndexWriter called "autoCommit", and a new public method "commit()" to
> IndexWriter and IndexReader (we'd have to rename the current protected
> commit() in IndexReader)
>
> When autoCommit is true, this means every write of a segments_N file
> will be "commit a snapshot", meaning readers will then use it for
> searching. This will be the default and this is exactly how Lucene
> behaves today. So this change is completely backwards compatible.
>
> When autoCommit is false, then when Lucene chooses to save a
> segments_N file it's just a "checkpoint": a reader would not open or
> re-open to the checkpoint. This means the developer must then call
> IndexWriter.commit() or IndexReader.commit() in order to "commit a
> snapshot" at the right time, thereby telling readers that this
> segments_N file is a valid one to switch to for searching.
>
>
> The implementation is very simple (I have an initial coarse prototype
> working with all but the last bullet):
>
> * If a segments_N file is just a checkpoint, it's named
> "segmentsx_N" (note the added 'x'); if it's a snapshot, it's named
> "segments_N". No other changes to the index format.
>
> * A reader by default opens the latest snapshot but can optionally
> open a specific N (segments_N) snapshot.
>
> * A writer by default starts from the most recent "checkpoint" but
> may also take a specific checkpoint or snapshot point N
> (segments_N) to start from (to allow rollback).
>
> * Change IndexReader.isCurrent() to see if there are any newer
> snapshots but disregard newer checkpoints.
>
> * When a writer is in autoCommit=false mode, it always writes to the
> next segmentsx_N; else it writes to segments_N.
>
> * The commit() method would just write to the next segments_N file
> and return the N it had written (in case application needs to
> re-use it later).
>
> * IndexFileDeleter would need to have a slightly smarter policy when
> autoCommit=false, ie, "don't delete anything referenced by either
> the past N snapshots or if the snapshot was obsoleted less than X
> minutes ago".
>
>
> I think there are some compelling things this could solve:
>
> * The "delete then add" problem (really a special but very common
> case of general transactions):
>
> Right now when you want to update a bunch of documents in a Lucene
> index, it's best to open a reader, do a "batch delete", close the
> reader, open a writer, do a "batch add", close the writer. This
> is the suggested way.
>
> The open risk here is that a reader could refresh at any time
> during these operations, and find that a bunch of documents have
> been deleted but not yet added again.
>
> Whereas, with autoCommit false you could do this entire operation
> (batch delete then batch add), and then call the final commit() in
> the end, and readers would know not to re-open the index until
> that final commit() succeeded.
>
> * The "using too much disk space during optimize" problem:
>
> This came up on the user's list recently: if you aggressively
> refresh readers while optimize() is running, you can tie up much
> more disk space than you'd expect, because your readers are
> holding open all the [possibly very large] intermediate segments.
>
> Whereas, if autoCommit is false, then developer calls optimize()
> and then calls commit(), the readers would know not to re-open
> until optimize was complete.
>
> * More general transactions:
>
> It has come up a fair number of times how to make Lucene
> transactional, either by itself ("do the following complex series
> of index operations but if there is any failure, rollback to the
> start, and don't expose result to searcher until all operations
> are done") or as part of a larger transaction eg involving a
> relational database.
>
> EG, if you want to add a big set of documents to Lucene, but not
> make them searchable until they are all added, or until a specific
> time (eg Monday @ 9 AM), you can't do that easily today but it
> would be simple with explicit commits.
>
> I believe this change would make transactions work correctly with
> Lucene.
>
> * LUCENE-710 ("implement point in time searching without relying on
> filesystem semantics"), also known as "getting Lucene to work
> correctly over NFS".
>
> I think this issue is nearly solved when autoCommit=false, as long
> as we can adopt a shared policy on "when readers refresh" to match
> the new deletion policy (described above). Basically, as long as
> the deleter and readers are playing by the same "refresh rules"
> and the writer gives the readers enough time to switch/warm, then
> the deleter should never delete something in use by a reader.
>
>
>
> There are also some neat future things made possible:
>
> * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
> could have a more efficient implementation (just like Solr) when
> autoCommit is false, because deletes don't need to be flushed
> until commit() is called. Whereas, now, they must be aggressively
> flushed on each checkpoint.
>
> * More generally, because "checkpoints" do not need to be usable by
> a reader/searcher, other neat optimizations might be possible.
>
> EG maybe the merge policy could be improved if it knows that
> certain segments are "just checkpoints" and are not involved in
> searching.
>
> * I could simplify the approach for my recent addIndexes changes
> (LUCENE-702) to use this, instead of it's current approach (wish I
> had thought of this sooner: ugh!.).
>
> * A single index could hold many snapshots, and, we could enable a
> reader to explicitly open against an older snapshot. EG maybe you
> take weekly and a monthly snapshot because you sometimes want to
> go back and "run a search on last week's catalog".
>
> Feedback?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jan 15, 2007, 3:49 AM

Post #3 of 62 (4736 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

Chuck,

> This seems to me to be a great idea, especially the ability to support
> index transactions.
>
> ParallelWriter (original implementation in LUCENE-600 -- I have a much
> better one now) provides a companion writer to ParallelReader. It takes
> a Document, breaks it up into subdocuments associated with parallel
> indexes that partition the fields, and writes those subdocuments into
> their respective parallel indexes. ParallelReader requires that the
> parallel indexes remain doc-id synchronized, which severely limits the
> opportunity for concurrent writing due to the possibility of the reader
> reopening when the indexes are out of sync (more Documents in one than
> another) and due to errors writing some subdocument(s) of a set when the
> others succeed.
>
> The new version of ParallelWriter, not in jira yet, provides more
> concurrency and provides better error recovery than the version there
> now, but it still limited in possible concurrency and in the worst case
> (when other recovery options fail) may have to fully optimize the
> indexes to back out the case were only a subset of the subdocuments
> derived from a given document fail to write. The root cause for the
> horrible error recovery case is the uncontrollable and unrevertable
> merging that may arise from adding a single document.
>
> I believe what you propose would provide the foundation to fully solve
> these problems efficiently, yielding much more concurrency and
> guaranteeing efficient error recovery in ParallelWriter. Also it would
> simplify some other cases where transactional integrity is essential in
> my current app. So this really sounds great.

Neat!! This sounds like a perfect fit: with explicit commits in the
index you should be able to greatly simplify ParallelWriter because
you're safe knowing readers would never open an "update in progress"
(ie a checkpoint segmentsx_N), and if you hit any error, you can
easily re-open your ParallelWriter against the last committed snapshot
(segments_N). Ie your error recovery becomes trivial and correct.

I had not thought of this use case. I think there are lots of
important use cases lurking out there that are enabled once we
have explicit commits.

> Possibly related, one of the ways I improved concurrency in
> ParallelWriter was to break up IndexWriter.addDocument() into one method
> to invert the document and create a RAMSegment and a second method that
> takes the RAMSegment and merges it into the index. This allows
> inversions to be processed in parallel, while merging is already a
> critical section. (Side thought: I've been wondering how hard it would
> be to make merging not a critical section). I had thought of the method
> to take the RAMSegment and merge it to be the "commit" part of
> addDocument().

Re side thought:

I think this may be another use case enabled by explicit commits: you
could imagine separate threads building up / merging their own private
set of segments and then merely adding them into the primary index.
What explicit commits can buy you is the fact that all these "private
segments" need not be made searchable until a commit() is called. So
in-between commits there should be alot of room for concurrency in
merging segments.

> Your notion of commit is much better and more flexible, but perhaps you
> could include this inversion/merge separation as well?

I'm a little confused on what this would mean? Do you mean opening up
separate public methods: one to invert (and get a segment back) and
one to append (and possibly merge) a segment to the index (keeping the
existing addDocument that would then just call these two)? How would
this buy you more concurrency (since the current method indeed only
synchronizes around the merge part)? Oh: would you behind the scenes
take each "single doc" segment and pre-merge them privatelyx,
concurrently, possibly up to many levels, privately, and then finally
add the merged segment into the index? Ie, the beginnings of
"concurrent merge" described above?

Actually couldn't we do this change today (ie without waiting for
explicit commits)? It seems like a separable change.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


chuck at manawiz

Jan 15, 2007, 9:54 AM

Post #4 of 62 (4729 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

Michael McCandless wrote on 01/15/2007 01:49 AM:
> Chuck,
>
>> Possibly related, one of the ways I improved concurrency in
>> ParallelWriter was to break up IndexWriter.addDocument() into one method
>> to invert the document and create a RAMSegment and a second method that
>> takes the RAMSegment and merges it into the index. This allows
>> inversions to be processed in parallel, while merging is already a
>> critical section. (Side thought: I've been wondering how hard it would
>> be to make merging not a critical section). I had thought of the method
>> to take the RAMSegment and merge it to be the "commit" part of
>> addDocument().
>
>> Your notion of commit is much better and more flexible, but perhaps you
>> could include this inversion/merge separation as well?
>
> I'm a little confused on what this would mean? Do you mean opening up
> separate public methods: one to invert (and get a segment back) and
> one to append (and possibly merge) a segment to the index (keeping the
> existing addDocument that would then just call these two)? How would
> this buy you more concurrency (since the current method indeed only
> synchronizes around the merge part)? Oh: would you behind the scenes
> take each "single doc" segment and pre-merge them privatelyx,
> concurrently, possibly up to many levels, privately, and then finally
> add the merged segment into the index? Ie, the beginnings of
> "concurrent merge" described above?
>
> Actually couldn't we do this change today (ie without waiting for
> explicit commits)? It seems like a separable change.

Yes, I've already made this change so it is independent, creating
invertDocument(), addInvertedDocument() and abortInvertedDocument().
This enables more concurrency in ParallelWriter because there are no
synchronization restrictions at all on calling invertDocument().
addInvertedDocument() has a synchronization requirement: it can be
called in parallel for each subdocument corresponding to the same
document, but not for subdocuments corresponding to different documents
as this could break the required parallel subindex doc-id
correspondence. Because addDocument() (which is just
addInvertedDocument(invertDocument())) contains the call to
addInvertedDocument() it has the same synchronization requirement,
preventing the extra parallelism in the invertDocument() calls.

It seemed to me that this could be related to the your explicit-commits
idea since it also breaks up writes into an uncommitted local portion
and committed portion.

Hope you put your explicit commits idea together soon!

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Jan 15, 2007, 10:01 AM

Post #5 of 62 (4724 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

Is your parallel adding code available?

On Jan 15, 2007, at 11:54 AM, Chuck Williams wrote:

>
> Michael McCandless wrote on 01/15/2007 01:49 AM:
>> Chuck,
>>
>>> Possibly related, one of the ways I improved concurrency in
>>> ParallelWriter was to break up IndexWriter.addDocument() into one
>>> method
>>> to invert the document and create a RAMSegment and a second
>>> method that
>>> takes the RAMSegment and merges it into the index. This allows
>>> inversions to be processed in parallel, while merging is already a
>>> critical section. (Side thought: I've been wondering how hard
>>> it would
>>> be to make merging not a critical section). I had thought of the
>>> method
>>> to take the RAMSegment and merge it to be the "commit" part of
>>> addDocument().
>>
>>> Your notion of commit is much better and more flexible, but
>>> perhaps you
>>> could include this inversion/merge separation as well?
>>
>> I'm a little confused on what this would mean? Do you mean
>> opening up
>> separate public methods: one to invert (and get a segment back) and
>> one to append (and possibly merge) a segment to the index (keeping
>> the
>> existing addDocument that would then just call these two)? How would
>> this buy you more concurrency (since the current method indeed only
>> synchronizes around the merge part)? Oh: would you behind the scenes
>> take each "single doc" segment and pre-merge them privatelyx,
>> concurrently, possibly up to many levels, privately, and then finally
>> add the merged segment into the index? Ie, the beginnings of
>> "concurrent merge" described above?
>>
>> Actually couldn't we do this change today (ie without waiting for
>> explicit commits)? It seems like a separable change.
>
> Yes, I've already made this change so it is independent, creating
> invertDocument(), addInvertedDocument() and abortInvertedDocument().
> This enables more concurrency in ParallelWriter because there are no
> synchronization restrictions at all on calling invertDocument().
> addInvertedDocument() has a synchronization requirement: it can be
> called in parallel for each subdocument corresponding to the same
> document, but not for subdocuments corresponding to different
> documents
> as this could break the required parallel subindex doc-id
> correspondence. Because addDocument() (which is just
> addInvertedDocument(invertDocument())) contains the call to
> addInvertedDocument() it has the same synchronization requirement,
> preventing the extra parallelism in the invertDocument() calls.
>
> It seemed to me that this could be related to the your explicit-
> commits
> idea since it also breaks up writes into an uncommitted local portion
> and committed portion.
>
> Hope you put your explicit commits idea together soon!
>
> Chuck
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


chuck at manawiz

Jan 15, 2007, 10:24 AM

Post #6 of 62 (4720 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

robert engels wrote on 01/15/2007 08:01 AM:
> Is your parallel adding code available?
>
There is an early version in LUCENE-600, but without the enhancements
described. I didn't update that version because it didn't capture any
interest and requires Java 1.5 and so it seems will not be committed.

I could update jira with the new version, but would have to create a
clean patch that applies again the lucene head. My local copy is
diverged due to a number of uncommitted patches and so patches generated
from it contain other stuff.

My use case for parallel subindexes is as an enabler for fast bulk
updates. Only the subindexes containing changing fields need to be
updated, so long as the update algorithm does not change doc-ids. Even
though this requires rewriting entire segments using techniques similar
to those used in merging (but not purging deleted docs), I'm still
getting 30x (when many fields changed) to many hundreds-x (when only a
few fields changing) faster update performance than the batched
delete-add method on very large indexes (million of documents, some very
large).

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Jan 15, 2007, 10:28 AM

Post #7 of 62 (4708 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

I looked at doing a similar thing with the parallel 'inverting'.

I then decided that it will only make a difference on a multiple CPU
machine, so I put it on the back burner.

But if you have code already done...

On Jan 15, 2007, at 12:24 PM, Chuck Williams wrote:

> robert engels wrote on 01/15/2007 08:01 AM:
>> Is your parallel adding code available?
>>
> There is an early version in LUCENE-600, but without the enhancements
> described. I didn't update that version because it didn't capture any
> interest and requires Java 1.5 and so it seems will not be committed.
>
> I could update jira with the new version, but would have to create a
> clean patch that applies again the lucene head. My local copy is
> diverged due to a number of uncommitted patches and so patches
> generated
> from it contain other stuff.
>
> My use case for parallel subindexes is as an enabler for fast bulk
> updates. Only the subindexes containing changing fields need to be
> updated, so long as the update algorithm does not change doc-ids.
> Even
> though this requires rewriting entire segments using techniques
> similar
> to those used in merging (but not purging deleted docs), I'm still
> getting 30x (when many fields changed) to many hundreds-x (when only a
> few fields changing) faster update performance than the batched
> delete-add method on very large indexes (million of documents, some
> very
> large).
>
> Chuck
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


DORONC at il

Jan 15, 2007, 1:12 PM

Post #8 of 62 (4727 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

Also related is the request made several times in the list to be able to
control when docids are changing, for applications that need to maintain
some mapping between external IDs to Lucene docs but for some performance
reasons cannot afford to only count on storing external (DB) IDs in
Lucene's field. For instance, recent discussion "Making document numbers
persistent" in java-user.

So, an application controlled commit would allow an application not to
"experience" document numbering changes - no docid changes would affect the
application until a commit is issued. So the application would be able to
call optimize and then issue a commit, thereby exposing docid changes.

One disadvantage of controlling ids changes like this is that search would
have to stale long behind index updates, unless optimize is called.

Therefore, - that's another issue of course - I am wondering if there might
be interest in allowing applications to control whether deleted docs are
allowed to be removed/squeezed-out or not.

Michael McCandless <lucene [at] mikemccandless> wrote on 14/01/2007
13:36:34:

> Team,
>
> I've been struggling to find a clean solution for LUCENE-710, when I
> thought of a simple addition to Lucene ("explicit commits") that would
> I think resolve LUCENE-710 and would fix a few other outstanding
> issues when readers are using a "live" index (being updated by a
> writer).
>
> The basic idea is to add an explicit "commit" operation to Lucene.
>
> This is the same nice feature Solr has, but just a different
> implementation (in Lucene core, in a single index, instead). The
> commit makes a "point in time" snapshot (term borrowed from Solr!)
> available for searching.
>
> The implementation is surprisingly simple (see below) and completely
> backwards compatible.
>
> I'd like to get some feedback on the idea/implementation.
>
>
> Details...: right now, Lucene writes a new segments_N file at various
> times: when a writer (or reader that's writing deletes/norms) needs to
> flush its pending changes to disk; when a writer merges segments; when
> a writer is closed; multiple times during optimize/addIndexes; etc.
>
> These times are not controllable / predictable to the developer using
> Lucene.
>
> A new reader always opens the last segments_N written, and, when a
> reader uses isCurrent() to check whether it should re-open (the
> suggested way), that method always returns false (meaning you should
> re-open) if there are any new segments_N files.
>
> So it's somewhat uncontrollable to the developer what state the index
> is in when you [re-]open a reader.
>
> People work around this today by adding logic above Lucene so that the
> writer separately communicates to readers when is a good time to
> refresh. But with "explicit commits", readers could instead look
> directly at the index and pick the right segments_N to refresh to.
>
> I'm proposing that we separate the writing of a new segments_N file
> into those writes that are done automatically by Lucene (I'll call
> these "checkpoints") from meaningful (to the application) commits that
> are done explicitly by the developer at known times (I'll call this
> "committing a snapshot"). I would add a new boolean mode to
> IndexWriter called "autoCommit", and a new public method "commit()" to
> IndexWriter and IndexReader (we'd have to rename the current protected
> commit() in IndexReader)
>
> When autoCommit is true, this means every write of a segments_N file
> will be "commit a snapshot", meaning readers will then use it for
> searching. This will be the default and this is exactly how Lucene
> behaves today. So this change is completely backwards compatible.
>
> When autoCommit is false, then when Lucene chooses to save a
> segments_N file it's just a "checkpoint": a reader would not open or
> re-open to the checkpoint. This means the developer must then call
> IndexWriter.commit() or IndexReader.commit() in order to "commit a
> snapshot" at the right time, thereby telling readers that this
> segments_N file is a valid one to switch to for searching.
>
>
> The implementation is very simple (I have an initial coarse prototype
> working with all but the last bullet):
>
> * If a segments_N file is just a checkpoint, it's named
> "segmentsx_N" (note the added 'x'); if it's a snapshot, it's named
> "segments_N". No other changes to the index format.
>
> * A reader by default opens the latest snapshot but can optionally
> open a specific N (segments_N) snapshot.
>
> * A writer by default starts from the most recent "checkpoint" but
> may also take a specific checkpoint or snapshot point N
> (segments_N) to start from (to allow rollback).
>
> * Change IndexReader.isCurrent() to see if there are any newer
> snapshots but disregard newer checkpoints.
>
> * When a writer is in autoCommit=false mode, it always writes to the
> next segmentsx_N; else it writes to segments_N.
>
> * The commit() method would just write to the next segments_N file
> and return the N it had written (in case application needs to
> re-use it later).
>
> * IndexFileDeleter would need to have a slightly smarter policy when
> autoCommit=false, ie, "don't delete anything referenced by either
> the past N snapshots or if the snapshot was obsoleted less than X
> minutes ago".
>
>
> I think there are some compelling things this could solve:
>
> * The "delete then add" problem (really a special but very common
> case of general transactions):
>
> Right now when you want to update a bunch of documents in a Lucene
> index, it's best to open a reader, do a "batch delete", close the
> reader, open a writer, do a "batch add", close the writer. This
> is the suggested way.
>
> The open risk here is that a reader could refresh at any time
> during these operations, and find that a bunch of documents have
> been deleted but not yet added again.
>
> Whereas, with autoCommit false you could do this entire operation
> (batch delete then batch add), and then call the final commit() in
> the end, and readers would know not to re-open the index until
> that final commit() succeeded.
>
> * The "using too much disk space during optimize" problem:
>
> This came up on the user's list recently: if you aggressively
> refresh readers while optimize() is running, you can tie up much
> more disk space than you'd expect, because your readers are
> holding open all the [possibly very large] intermediate segments.
>
> Whereas, if autoCommit is false, then developer calls optimize()
> and then calls commit(), the readers would know not to re-open
> until optimize was complete.
>
> * More general transactions:
>
> It has come up a fair number of times how to make Lucene
> transactional, either by itself ("do the following complex series
> of index operations but if there is any failure, rollback to the
> start, and don't expose result to searcher until all operations
> are done") or as part of a larger transaction eg involving a
> relational database.
>
> EG, if you want to add a big set of documents to Lucene, but not
> make them searchable until they are all added, or until a specific
> time (eg Monday @ 9 AM), you can't do that easily today but it
> would be simple with explicit commits.
>
> I believe this change would make transactions work correctly with
> Lucene.
>
> * LUCENE-710 ("implement point in time searching without relying on
> filesystem semantics"), also known as "getting Lucene to work
> correctly over NFS".
>
> I think this issue is nearly solved when autoCommit=false, as long
> as we can adopt a shared policy on "when readers refresh" to match
> the new deletion policy (described above). Basically, as long as
> the deleter and readers are playing by the same "refresh rules"
> and the writer gives the readers enough time to switch/warm, then
> the deleter should never delete something in use by a reader.
>
>
>
> There are also some neat future things made possible:
>
> * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
> could have a more efficient implementation (just like Solr) when
> autoCommit is false, because deletes don't need to be flushed
> until commit() is called. Whereas, now, they must be aggressively
> flushed on each checkpoint.
>
> * More generally, because "checkpoints" do not need to be usable by
> a reader/searcher, other neat optimizations might be possible.
>
> EG maybe the merge policy could be improved if it knows that
> certain segments are "just checkpoints" and are not involved in
> searching.
>
> * I could simplify the approach for my recent addIndexes changes
> (LUCENE-702) to use this, instead of it's current approach (wish I
> had thought of this sooner: ugh!.).
>
> * A single index could hold many snapshots, and, we could enable a
> reader to explicitly open against an older snapshot. EG maybe you
> take weekly and a monthly snapshot because you sometimes want to
> go back and "run a search on last week's catalog".
>
> Feedback?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Jan 15, 2007, 1:23 PM

Post #9 of 62 (4717 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

I think that you will find a much larger performance decrease in
doing things this way - if the external resource is a db, or any
networked accessed resource.

When even just a single document is changed in the Lucene index you
could have MILLIONS of changes to internal doc ids (if say an early
document was deleted).

Seems far better to store the external id in the Lucene index.

You will find that performance penalty of looking up the Lucene
document by the external id (vs. the internal doc #), to be far less
than the performance penalty of updating every document in the
external index when the Lucene index is merged.

The only case I can see this would be of any benefit is if the Lucene
index RARELY if EVER changes - anything else, and you will have big
problems.

Now, if the Lucene is changed to support point in time searching
(basically never delete any index files), you might be able to do
what you this. Just create a Directory only creating the segments up
to that time.

Sounds VERY messy to me.

On Jan 15, 2007, at 3:12 PM, Doron Cohen wrote:

> Also related is the request made several times in the list to be
> able to
> control when docids are changing, for applications that need to
> maintain
> some mapping between external IDs to Lucene docs but for some
> performance
> reasons cannot afford to only count on storing external (DB) IDs in
> Lucene's field. For instance, recent discussion "Making document
> numbers
> persistent" in java-user.
>
> So, an application controlled commit would allow an application not to
> "experience" document numbering changes - no docid changes would
> affect the
> application until a commit is issued. So the application would be
> able to
> call optimize and then issue a commit, thereby exposing docid changes.
>
> One disadvantage of controlling ids changes like this is that
> search would
> have to stale long behind index updates, unless optimize is called.
>
> Therefore, - that's another issue of course - I am wondering if
> there might
> be interest in allowing applications to control whether deleted
> docs are
> allowed to be removed/squeezed-out or not.
>
> Michael McCandless <lucene [at] mikemccandless> wrote on 14/01/2007
> 13:36:34:
>
>> Team,
>>
>> I've been struggling to find a clean solution for LUCENE-710, when I
>> thought of a simple addition to Lucene ("explicit commits") that
>> would
>> I think resolve LUCENE-710 and would fix a few other outstanding
>> issues when readers are using a "live" index (being updated by a
>> writer).
>>
>> The basic idea is to add an explicit "commit" operation to Lucene.
>>
>> This is the same nice feature Solr has, but just a different
>> implementation (in Lucene core, in a single index, instead). The
>> commit makes a "point in time" snapshot (term borrowed from Solr!)
>> available for searching.
>>
>> The implementation is surprisingly simple (see below) and completely
>> backwards compatible.
>>
>> I'd like to get some feedback on the idea/implementation.
>>
>>
>> Details...: right now, Lucene writes a new segments_N file at various
>> times: when a writer (or reader that's writing deletes/norms)
>> needs to
>> flush its pending changes to disk; when a writer merges segments;
>> when
>> a writer is closed; multiple times during optimize/addIndexes; etc.
>>
>> These times are not controllable / predictable to the developer using
>> Lucene.
>>
>> A new reader always opens the last segments_N written, and, when a
>> reader uses isCurrent() to check whether it should re-open (the
>> suggested way), that method always returns false (meaning you should
>> re-open) if there are any new segments_N files.
>>
>> So it's somewhat uncontrollable to the developer what state the index
>> is in when you [re-]open a reader.
>>
>> People work around this today by adding logic above Lucene so that
>> the
>> writer separately communicates to readers when is a good time to
>> refresh. But with "explicit commits", readers could instead look
>> directly at the index and pick the right segments_N to refresh to.
>>
>> I'm proposing that we separate the writing of a new segments_N file
>> into those writes that are done automatically by Lucene (I'll call
>> these "checkpoints") from meaningful (to the application) commits
>> that
>> are done explicitly by the developer at known times (I'll call this
>> "committing a snapshot"). I would add a new boolean mode to
>> IndexWriter called "autoCommit", and a new public method "commit
>> ()" to
>> IndexWriter and IndexReader (we'd have to rename the current
>> protected
>> commit() in IndexReader)
>>
>> When autoCommit is true, this means every write of a segments_N file
>> will be "commit a snapshot", meaning readers will then use it for
>> searching. This will be the default and this is exactly how Lucene
>> behaves today. So this change is completely backwards compatible.
>>
>> When autoCommit is false, then when Lucene chooses to save a
>> segments_N file it's just a "checkpoint": a reader would not open or
>> re-open to the checkpoint. This means the developer must then call
>> IndexWriter.commit() or IndexReader.commit() in order to "commit a
>> snapshot" at the right time, thereby telling readers that this
>> segments_N file is a valid one to switch to for searching.
>>
>>
>> The implementation is very simple (I have an initial coarse prototype
>> working with all but the last bullet):
>>
>> * If a segments_N file is just a checkpoint, it's named
>> "segmentsx_N" (note the added 'x'); if it's a snapshot, it's
>> named
>> "segments_N". No other changes to the index format.
>>
>> * A reader by default opens the latest snapshot but can optionally
>> open a specific N (segments_N) snapshot.
>>
>> * A writer by default starts from the most recent "checkpoint" but
>> may also take a specific checkpoint or snapshot point N
>> (segments_N) to start from (to allow rollback).
>>
>> * Change IndexReader.isCurrent() to see if there are any newer
>> snapshots but disregard newer checkpoints.
>>
>> * When a writer is in autoCommit=false mode, it always writes
>> to the
>> next segmentsx_N; else it writes to segments_N.
>>
>> * The commit() method would just write to the next segments_N file
>> and return the N it had written (in case application needs to
>> re-use it later).
>>
>> * IndexFileDeleter would need to have a slightly smarter policy
>> when
>> autoCommit=false, ie, "don't delete anything referenced by
>> either
>> the past N snapshots or if the snapshot was obsoleted less
>> than X
>> minutes ago".
>>
>>
>> I think there are some compelling things this could solve:
>>
>> * The "delete then add" problem (really a special but very common
>> case of general transactions):
>>
>> Right now when you want to update a bunch of documents in a
>> Lucene
>> index, it's best to open a reader, do a "batch delete", close
>> the
>> reader, open a writer, do a "batch add", close the writer. This
>> is the suggested way.
>>
>> The open risk here is that a reader could refresh at any time
>> during these operations, and find that a bunch of documents have
>> been deleted but not yet added again.
>>
>> Whereas, with autoCommit false you could do this entire
>> operation
>> (batch delete then batch add), and then call the final commit
>> () in
>> the end, and readers would know not to re-open the index until
>> that final commit() succeeded.
>>
>> * The "using too much disk space during optimize" problem:
>>
>> This came up on the user's list recently: if you aggressively
>> refresh readers while optimize() is running, you can tie up much
>> more disk space than you'd expect, because your readers are
>> holding open all the [possibly very large] intermediate
>> segments.
>>
>> Whereas, if autoCommit is false, then developer calls optimize()
>> and then calls commit(), the readers would know not to re-open
>> until optimize was complete.
>>
>> * More general transactions:
>>
>> It has come up a fair number of times how to make Lucene
>> transactional, either by itself ("do the following complex
>> series
>> of index operations but if there is any failure, rollback to the
>> start, and don't expose result to searcher until all operations
>> are done") or as part of a larger transaction eg involving a
>> relational database.
>>
>> EG, if you want to add a big set of documents to Lucene, but not
>> make them searchable until they are all added, or until a
>> specific
>> time (eg Monday @ 9 AM), you can't do that easily today but it
>> would be simple with explicit commits.
>>
>> I believe this change would make transactions work correctly
>> with
>> Lucene.
>>
>> * LUCENE-710 ("implement point in time searching without
>> relying on
>> filesystem semantics"), also known as "getting Lucene to work
>> correctly over NFS".
>>
>> I think this issue is nearly solved when autoCommit=false, as
>> long
>> as we can adopt a shared policy on "when readers refresh" to
>> match
>> the new deletion policy (described above). Basically, as
>> long as
>> the deleter and readers are playing by the same "refresh rules"
>> and the writer gives the readers enough time to switch/warm,
>> then
>> the deleter should never delete something in use by a reader.
>>
>>
>>
>> There are also some neat future things made possible:
>>
>> * The "support deleteDocuments in IndexWriter" (LUCENE-565)
>> feature
>> could have a more efficient implementation (just like Solr) when
>> autoCommit is false, because deletes don't need to be flushed
>> until commit() is called. Whereas, now, they must be
>> aggressively
>> flushed on each checkpoint.
>>
>> * More generally, because "checkpoints" do not need to be
>> usable by
>> a reader/searcher, other neat optimizations might be possible.
>>
>> EG maybe the merge policy could be improved if it knows that
>> certain segments are "just checkpoints" and are not involved in
>> searching.
>>
>> * I could simplify the approach for my recent addIndexes changes
>> (LUCENE-702) to use this, instead of it's current approach
>> (wish I
>> had thought of this sooner: ugh!.).
>>
>> * A single index could hold many snapshots, and, we could enable a
>> reader to explicitly open against an older snapshot. EG
>> maybe you
>> take weekly and a monthly snapshot because you sometimes want to
>> go back and "run a search on last week's catalog".
>>
>> Feedback?
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Jan 15, 2007, 7:25 PM

Post #10 of 62 (4718 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

Actually, my comment below was not quite accurate. It only matter on
multiple CPU machines if you are writing everything to a memory index
first.

If writing to a filesystem, then multiple threads on a single
processor would allow more documents to be inverted while the disk
write were occurring, as long as both COULD be done concurrently.


On Jan 15, 2007, at 12:28 PM, robert engels wrote:

> I looked at doing a similar thing with the parallel 'inverting'.
>
> I then decided that it will only make a difference on a multiple
> CPU machine, so I put it on the back burner.
>
> But if you have code already done...
>
> On Jan 15, 2007, at 12:24 PM, Chuck Williams wrote:
>
>> robert engels wrote on 01/15/2007 08:01 AM:
>>> Is your parallel adding code available?
>>>
>> There is an early version in LUCENE-600, but without the enhancements
>> described. I didn't update that version because it didn't capture
>> any
>> interest and requires Java 1.5 and so it seems will not be committed.
>>
>> I could update jira with the new version, but would have to create a
>> clean patch that applies again the lucene head. My local copy is
>> diverged due to a number of uncommitted patches and so patches
>> generated
>> from it contain other stuff.
>>
>> My use case for parallel subindexes is as an enabler for fast bulk
>> updates. Only the subindexes containing changing fields need to be
>> updated, so long as the update algorithm does not change doc-ids.
>> Even
>> though this requires rewriting entire segments using techniques
>> similar
>> to those used in merging (but not purging deleted docs), I'm still
>> getting 30x (when many fields changed) to many hundreds-x (when
>> only a
>> few fields changing) faster update performance than the batched
>> delete-add method on very large indexes (million of documents,
>> some very
>> large).
>>
>> Chuck
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


ning.li.li at gmail

Jan 15, 2007, 8:29 PM

Post #11 of 62 (4728 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

On 1/14/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
> could have a more efficient implementation (just like Solr) when
> autoCommit is false, because deletes don't need to be flushed
> until commit() is called. Whereas, now, they must be aggressively
> flushed on each checkpoint.

The idea of adding "explicit commits" is good! And in time - I was
just about to submit a latest patch for LUCENE-565. With this feature,
the frequency of reader open/close on large old segments could be
reduced when autoCommit is false.

Based on your proposal, however, an application wouldn't be able to
delete any documents that have not been committed since a reader
always opens a snapshot (segments_N), but not a checkpoint
(segmentsx_N). This functionality will be supported by LUCENE-565, but
I wonder if it should be supported in general. So maybe a reader can
open the latest checkpoint for modification, but only snapshots for
search...

If a reader can only open snapshots both for search and for
modification, I think another change is needed besides the ones
listed: assume the latest snapshot is segments_5 and the latest
checkpoint is segmentsx_7 with 2 new segments, then a reader opens
snapshot segments_5, performs a few deletes and writes a new
checkpoint segmentsx_8. The summary file segmentsx_8 should include
the 2 new segments which are in segmentsx_7 but not in segments_5.
Such segments to include are easily identifiable only if they are not
merged with segments in the latest snapshot... All these won't be
necessary if a reader always opens the latest checkpoint for
modification, which will also support deletion of non-committed
documents.

Lastly, hopefully the term "transaction" won't cause any confusion
since this "explicit commit" is much simpler than database transaction
where a database can guarantee the ACID properties for each of
multiple concurrent transactions.

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


chuck at manawiz

Jan 15, 2007, 9:19 PM

Post #12 of 62 (4720 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

Ning Li wrote on 01/15/2007 06:29 PM:
> On 1/14/07, Michael McCandless <lucene [at] mikemccandless> wrote:
>> * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
>> could have a more efficient implementation (just like Solr) when
>> autoCommit is false, because deletes don't need to be flushed
>> until commit() is called. Whereas, now, they must be aggressively
>> flushed on each checkpoint.
>
> If a reader can only open snapshots both for search and for
> modification, I think another change is needed besides the ones
> listed: assume the latest snapshot is segments_5 and the latest
> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
> snapshot segments_5, performs a few deletes and writes a new
> checkpoint segmentsx_8. The summary file segmentsx_8 should include
> the 2 new segments which are in segmentsx_7 but not in segments_5.
> Such segments to include are easily identifiable only if they are not
> merged with segments in the latest snapshot... All these won't be
> necessary if a reader always opens the latest checkpoint for
> modification, which will also support deletion of non-committed
> documents.
This problem seems worse. I don't see how a reader and a writer can
independently compute and write checkpoints. The adds in the writer
don't just create new segments, they replace existing ones through
merging. And the merging changes doc-ids by expunging deletes. It
seems that all deletes must be based on the most recent checkpoint, or
merging of checkpoints to create the next snapshot will be considerably
more complex.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Jan 15, 2007, 9:34 PM

Post #13 of 62 (4732 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

I honestly think that having a unique OID as an indexed field and
putting a layer on top of Lucene is the best solution to all of this.
It makes it almost trivial, and you can implement transaction
handling in a variety of ways.

Attempting to make the doc ids "permanent" is a tough challenge,
considering the orignal design called for them to be "non permanent".

It seems doubtful that you cannot have some sort of primary key any
way and be this concerned about the transactional nature of Lucene.

I vote -1 on all of this. I think it will detract from the simple and
efficient storage mechanism that Lucene uses.

On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:

> Ning Li wrote on 01/15/2007 06:29 PM:
>> On 1/14/07, Michael McCandless <lucene [at] mikemccandless> wrote:
>>> * The "support deleteDocuments in IndexWriter" (LUCENE-565)
>>> feature
>>> could have a more efficient implementation (just like Solr) when
>>> autoCommit is false, because deletes don't need to be flushed
>>> until commit() is called. Whereas, now, they must be
>>> aggressively
>>> flushed on each checkpoint.
>>
>> If a reader can only open snapshots both for search and for
>> modification, I think another change is needed besides the ones
>> listed: assume the latest snapshot is segments_5 and the latest
>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
>> snapshot segments_5, performs a few deletes and writes a new
>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
>> the 2 new segments which are in segmentsx_7 but not in segments_5.
>> Such segments to include are easily identifiable only if they are not
>> merged with segments in the latest snapshot... All these won't be
>> necessary if a reader always opens the latest checkpoint for
>> modification, which will also support deletion of non-committed
>> documents.
> This problem seems worse. I don't see how a reader and a writer can
> independently compute and write checkpoints. The adds in the writer
> don't just create new segments, they replace existing ones through
> merging. And the merging changes doc-ids by expunging deletes. It
> seems that all deletes must be based on the most recent checkpoint, or
> merging of checkpoints to create the next snapshot will be
> considerably
> more complex.
>
> Chuck
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


chuck at manawiz

Jan 15, 2007, 9:49 PM

Post #14 of 62 (4726 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

My interest is transactions, not making doc-id's permanent.
Specifically, the ability to ensure that a group of adds either all go
into the index or none go into the index, and to ensure that if none go
into the index that the index is not changed in any way.

I have UID's but they cannot ensure the latter property, i.e. they
cannot ensure side-effect-free rollbacks.

Yes, if you have no reliance on internal Lucene structures like doc-id's
and segments, then that shouldn't matter. But many capabilities have
such reliance for good reasons. E.g., ParallelReader, which is a public
supported class in Lucene, requires doc-id synchronization. There are
similar good reasons for an application to take advantage of doc-ids.

Lucene uses doc-id's in many of its API's and so it is not surprising
that many applications rely on them, and I'm sure misuse them not fully
understanding the semantics and uncertainties of doc-id changes due to
merging segments with deletes.

Applications can use doc-ids for legitimate and beneficial purposes
while remaining semantically valid. Making such capabilities efficient
and robust in all cases is facilitated by application control over when
doc-id's and segment structure change at a granularity larger than the
single Document.

If I had a vote it would be +1 on the direction Michael has proposed,
assuming it can be done robustly and without performance penalty.

Chuck


robert engels wrote on 01/15/2007 07:34 PM:
> I honestly think that having a unique OID as an indexed field and
> putting a layer on top of Lucene is the best solution to all of this.
> It makes it almost trivial, and you can implement transaction handling
> in a variety of ways.
>
> Attempting to make the doc ids "permanent" is a tough challenge,
> considering the orignal design called for them to be "non permanent".
>
> It seems doubtful that you cannot have some sort of primary key any
> way and be this concerned about the transactional nature of Lucene.
>
> I vote -1 on all of this. I think it will detract from the simple and
> efficient storage mechanism that Lucene uses.
>
> On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
>
>> Ning Li wrote on 01/15/2007 06:29 PM:
>>> On 1/14/07, Michael McCandless <lucene [at] mikemccandless> wrote:
>>>> * The "support deleteDocuments in IndexWriter" (LUCENE-565) feature
>>>> could have a more efficient implementation (just like Solr) when
>>>> autoCommit is false, because deletes don't need to be flushed
>>>> until commit() is called. Whereas, now, they must be aggressively
>>>> flushed on each checkpoint.
>>>
>>> If a reader can only open snapshots both for search and for
>>> modification, I think another change is needed besides the ones
>>> listed: assume the latest snapshot is segments_5 and the latest
>>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
>>> snapshot segments_5, performs a few deletes and writes a new
>>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
>>> the 2 new segments which are in segmentsx_7 but not in segments_5.
>>> Such segments to include are easily identifiable only if they are not
>>> merged with segments in the latest snapshot... All these won't be
>>> necessary if a reader always opens the latest checkpoint for
>>> modification, which will also support deletion of non-committed
>>> documents.
>> This problem seems worse. I don't see how a reader and a writer can
>> independently compute and write checkpoints. The adds in the writer
>> don't just create new segments, they replace existing ones through
>> merging. And the merging changes doc-ids by expunging deletes. It
>> seems that all deletes must be based on the most recent checkpoint, or
>> merging of checkpoints to create the next snapshot will be considerably
>> more complex.
>>
>> Chuck
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Jan 15, 2007, 10:11 PM

Post #15 of 62 (4723 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

If that is all you need, I think it is far simpler:

If you have an OID, then al that is required is to a write to a
separate disk file the operations (delete this OID, insert this
document, etc...)

Once the file is permanently on disk. Then it is simple to just keep
playing the file back until it succeeds.

This is what we do in our search server.

I am not completely familiar with parallel reader, but in reading the
JavaDoc I don't see the benefit - since you have to write the
documents to both indexes anyway??? Why is it of any benefit to break
the document into multiple parts?

If you have OIDs available, parallel reader can be accomplished in a
far simpler and more efficient manner - we have a completely
federated server implementation that was trivial - less < 100 lines
of code. We did it simpler, and create a hash from the OID, and store
the document into a different index depending on the has, then run
the query across all indexes in parallel, joining the results.

On Jan 15, 2007, at 11:49 PM, Chuck Williams wrote:

> My interest is transactions, not making doc-id's permanent.
> Specifically, the ability to ensure that a group of adds either all go
> into the index or none go into the index, and to ensure that if
> none go
> into the index that the index is not changed in any way.
>
> I have UID's but they cannot ensure the latter property, i.e. they
> cannot ensure side-effect-free rollbacks.
>
> Yes, if you have no reliance on internal Lucene structures like doc-
> id's
> and segments, then that shouldn't matter. But many capabilities have
> such reliance for good reasons. E.g., ParallelReader, which is a
> public
> supported class in Lucene, requires doc-id synchronization. There are
> similar good reasons for an application to take advantage of doc-ids.
>
> Lucene uses doc-id's in many of its API's and so it is not surprising
> that many applications rely on them, and I'm sure misuse them not
> fully
> understanding the semantics and uncertainties of doc-id changes due to
> merging segments with deletes.
>
> Applications can use doc-ids for legitimate and beneficial purposes
> while remaining semantically valid. Making such capabilities
> efficient
> and robust in all cases is facilitated by application control over
> when
> doc-id's and segment structure change at a granularity larger than the
> single Document.
>
> If I had a vote it would be +1 on the direction Michael has proposed,
> assuming it can be done robustly and without performance penalty.
>
> Chuck
>
>
> robert engels wrote on 01/15/2007 07:34 PM:
>> I honestly think that having a unique OID as an indexed field and
>> putting a layer on top of Lucene is the best solution to all of this.
>> It makes it almost trivial, and you can implement transaction
>> handling
>> in a variety of ways.
>>
>> Attempting to make the doc ids "permanent" is a tough challenge,
>> considering the orignal design called for them to be "non permanent".
>>
>> It seems doubtful that you cannot have some sort of primary key any
>> way and be this concerned about the transactional nature of Lucene.
>>
>> I vote -1 on all of this. I think it will detract from the simple and
>> efficient storage mechanism that Lucene uses.
>>
>> On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
>>
>>> Ning Li wrote on 01/15/2007 06:29 PM:
>>>> On 1/14/07, Michael McCandless <lucene [at] mikemccandless> wrote:
>>>>> * The "support deleteDocuments in IndexWriter" (LUCENE-565)
>>>>> feature
>>>>> could have a more efficient implementation (just like Solr)
>>>>> when
>>>>> autoCommit is false, because deletes don't need to be flushed
>>>>> until commit() is called. Whereas, now, they must be
>>>>> aggressively
>>>>> flushed on each checkpoint.
>>>>
>>>> If a reader can only open snapshots both for search and for
>>>> modification, I think another change is needed besides the ones
>>>> listed: assume the latest snapshot is segments_5 and the latest
>>>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
>>>> snapshot segments_5, performs a few deletes and writes a new
>>>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
>>>> the 2 new segments which are in segmentsx_7 but not in segments_5.
>>>> Such segments to include are easily identifiable only if they
>>>> are not
>>>> merged with segments in the latest snapshot... All these won't be
>>>> necessary if a reader always opens the latest checkpoint for
>>>> modification, which will also support deletion of non-committed
>>>> documents.
>>> This problem seems worse. I don't see how a reader and a writer can
>>> independently compute and write checkpoints. The adds in the writer
>>> don't just create new segments, they replace existing ones through
>>> merging. And the merging changes doc-ids by expunging deletes. It
>>> seems that all deletes must be based on the most recent
>>> checkpoint, or
>>> merging of checkpoints to create the next snapshot will be
>>> considerably
>>> more complex.
>>>
>>> Chuck
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


chuck at manawiz

Jan 15, 2007, 10:38 PM

Post #16 of 62 (4711 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

robert engels wrote on 01/15/2007 08:11 PM:
> If that is all you need, I think it is far simpler:
>
> If you have an OID, then al that is required is to a write to a
> separate disk file the operations (delete this OID, insert this
> document, etc...)
>
> Once the file is permanently on disk. Then it is simple to just keep
> playing the file back until it succeeds.
There is no guarantee a given operation will ever succeed so this
doesn't work.
>
> This is what we do in our search server.
>
> I am not completely familiar with parallel reader, but in reading the
> JavaDoc I don't see the benefit - since you have to write the
> documents to both indexes anyway??? Why is it of any benefit to break
> the document into multiple parts?
I'm sure Doug had reasons to write it. My reason to use it is for fast
bulk updates, updating one subindex without having to update the others.
>
> If you have OIDs available, parallel reader can be accomplished in a
> far simpler and more efficient manner - we have a completely federated
> server implementation that was trivial - less < 100 lines of code. We
> did it simpler, and create a hash from the OID, and store the document
> into a different index depending on the has, then run the query across
> all indexes in parallel, joining the results.
Lucene has this built in via MultiSearcher and RemoteSearchable. It is
a bit more complex due to the necessity to normalize Weights, e.g. to
ensure the same docFreq's which reflect the union of all indexes are
used for the search in each.

Federated searching addresses different requirements than
ParallelReader. Yes, I agree that ParallelReader could be done using
UID's, but believe it would be a considerably more expensive
representation to search. The method used in federated search to
distribute the same query to each index is not applicable. Breaking the
query up into parts that are applied against each parallel index, with
each query part referencing only the fields in a single parallel index,
would be a challenge with complex nested queries supporting all of the
operators, and much less efficient than ParallelReader. Modifying all
the primitive Query subclasses to use UID's instead of doc-ids's would
be an alternative, but would be a lot of work and not nearly as
efficient as the existing Lucene index representation that sorts
postings by doc-id.

To illustrate this, consider the simple query, f:a AND g:b, where f and
g are in two different parallel indexes. Performing the f and g
queries separately on the different indexes to get possibly very long
lists of results and then joining those by UID will be much slower than
BooleanQuery operating on ParallelReader with doc-id sorted postings.
The alternative of a UID-based BooleanQuery would have similar
challenges unless the postings were sorted by UID. But hey, that's
permanent doc-ids.

Chuck

>
> On Jan 15, 2007, at 11:49 PM, Chuck Williams wrote:
>
>> My interest is transactions, not making doc-id's permanent.
>> Specifically, the ability to ensure that a group of adds either all go
>> into the index or none go into the index, and to ensure that if none go
>> into the index that the index is not changed in any way.
>>
>> I have UID's but they cannot ensure the latter property, i.e. they
>> cannot ensure side-effect-free rollbacks.
>>
>> Yes, if you have no reliance on internal Lucene structures like doc-id's
>> and segments, then that shouldn't matter. But many capabilities have
>> such reliance for good reasons. E.g., ParallelReader, which is a public
>> supported class in Lucene, requires doc-id synchronization. There are
>> similar good reasons for an application to take advantage of doc-ids.
>>
>> Lucene uses doc-id's in many of its API's and so it is not surprising
>> that many applications rely on them, and I'm sure misuse them not fully
>> understanding the semantics and uncertainties of doc-id changes due to
>> merging segments with deletes.
>>
>> Applications can use doc-ids for legitimate and beneficial purposes
>> while remaining semantically valid. Making such capabilities efficient
>> and robust in all cases is facilitated by application control over when
>> doc-id's and segment structure change at a granularity larger than the
>> single Document.
>>
>> If I had a vote it would be +1 on the direction Michael has proposed,
>> assuming it can be done robustly and without performance penalty.
>>
>> Chuck
>>
>>
>> robert engels wrote on 01/15/2007 07:34 PM:
>>> I honestly think that having a unique OID as an indexed field and
>>> putting a layer on top of Lucene is the best solution to all of this.
>>> It makes it almost trivial, and you can implement transaction handling
>>> in a variety of ways.
>>>
>>> Attempting to make the doc ids "permanent" is a tough challenge,
>>> considering the orignal design called for them to be "non permanent".
>>>
>>> It seems doubtful that you cannot have some sort of primary key any
>>> way and be this concerned about the transactional nature of Lucene.
>>>
>>> I vote -1 on all of this. I think it will detract from the simple and
>>> efficient storage mechanism that Lucene uses.
>>>
>>> On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
>>>
>>>> Ning Li wrote on 01/15/2007 06:29 PM:
>>>>> On 1/14/07, Michael McCandless <lucene [at] mikemccandless> wrote:
>>>>>> * The "support deleteDocuments in IndexWriter" (LUCENE-565)
>>>>>> feature
>>>>>> could have a more efficient implementation (just like Solr) when
>>>>>> autoCommit is false, because deletes don't need to be flushed
>>>>>> until commit() is called. Whereas, now, they must be
>>>>>> aggressively
>>>>>> flushed on each checkpoint.
>>>>>
>>>>> If a reader can only open snapshots both for search and for
>>>>> modification, I think another change is needed besides the ones
>>>>> listed: assume the latest snapshot is segments_5 and the latest
>>>>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
>>>>> snapshot segments_5, performs a few deletes and writes a new
>>>>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
>>>>> the 2 new segments which are in segmentsx_7 but not in segments_5.
>>>>> Such segments to include are easily identifiable only if they are not
>>>>> merged with segments in the latest snapshot... All these won't be
>>>>> necessary if a reader always opens the latest checkpoint for
>>>>> modification, which will also support deletion of non-committed
>>>>> documents.
>>>> This problem seems worse. I don't see how a reader and a writer can
>>>> independently compute and write checkpoints. The adds in the writer
>>>> don't just create new segments, they replace existing ones through
>>>> merging. And the merging changes doc-ids by expunging deletes. It
>>>> seems that all deletes must be based on the most recent checkpoint, or
>>>> merging of checkpoints to create the next snapshot will be
>>>> considerably
>>>> more complex.
>>>>
>>>> Chuck
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Jan 15, 2007, 11:04 PM

Post #17 of 62 (4730 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

That is true, but you need to use the same techniques as any db. You
need to write a tx log file. This has the semantics that you know if
it has committed. Juts like a db. You check that is has committed
before writing anything to the actual index. Since Lucene does not
modify any segments, it is trivial to restart if this portion fails.
Just delete the uncommitted segments on startup, and replay the tx log.

As for the ParallelReader, that doesn't make sense to me (but I am
admitting don't understand the purpose), since the javadoc states
that that all sub-indexes must be updated in the same manner. Where
does the benefit come from then? It seems you are actually performing
more operations (with 2 sub-indexes you are writing twice as many
documents - same amount of field data though). Is there some other
information besides the javadoc that explains the usage/benefit ?

Using a federated search where different fields are in different
indexes would be very difficult as you state, and involve long join
lists (and the scoring logic is VERY difficult unless you create a
new "memory index" containing all the results, and then run the
complete query against this.

Putting the documents in different indexes and joining/weighing the
results is rather easy and works quite well.


On Jan 16, 2007, at 12:38 AM, Chuck Williams wrote:

> robert engels wrote on 01/15/2007 08:11 PM:
>> If that is all you need, I think it is far simpler:
>>
>> If you have an OID, then al that is required is to a write to a
>> separate disk file the operations (delete this OID, insert this
>> document, etc...)
>>
>> Once the file is permanently on disk. Then it is simple to just keep
>> playing the file back until it succeeds.
> There is no guarantee a given operation will ever succeed so this
> doesn't work.
>>
>> This is what we do in our search server.
>>
>> I am not completely familiar with parallel reader, but in reading the
>> JavaDoc I don't see the benefit - since you have to write the
>> documents to both indexes anyway??? Why is it of any benefit to break
>> the document into multiple parts?
> I'm sure Doug had reasons to write it. My reason to use it is for
> fast
> bulk updates, updating one subindex without having to update the
> others.
>>
>> If you have OIDs available, parallel reader can be accomplished in a
>> far simpler and more efficient manner - we have a completely
>> federated
>> server implementation that was trivial - less < 100 lines of code. We
>> did it simpler, and create a hash from the OID, and store the
>> document
>> into a different index depending on the has, then run the query
>> across
>> all indexes in parallel, joining the results.
> Lucene has this built in via MultiSearcher and RemoteSearchable.
> It is
> a bit more complex due to the necessity to normalize Weights, e.g. to
> ensure the same docFreq's which reflect the union of all indexes are
> used for the search in each.
>
> Federated searching addresses different requirements than
> ParallelReader. Yes, I agree that ParallelReader could be done using
> UID's, but believe it would be a considerably more expensive
> representation to search. The method used in federated search to
> distribute the same query to each index is not applicable.
> Breaking the
> query up into parts that are applied against each parallel index, with
> each query part referencing only the fields in a single parallel
> index,
> would be a challenge with complex nested queries supporting all of the
> operators, and much less efficient than ParallelReader. Modifying all
> the primitive Query subclasses to use UID's instead of doc-ids's would
> be an alternative, but would be a lot of work and not nearly as
> efficient as the existing Lucene index representation that sorts
> postings by doc-id.
>
> To illustrate this, consider the simple query, f:a AND g:b, where f
> and
> g are in two different parallel indexes. Performing the f and g
> queries separately on the different indexes to get possibly very long
> lists of results and then joining those by UID will be much slower
> than
> BooleanQuery operating on ParallelReader with doc-id sorted postings.
> The alternative of a UID-based BooleanQuery would have similar
> challenges unless the postings were sorted by UID. But hey, that's
> permanent doc-ids.
>
> Chuck
>
>>
>> On Jan 15, 2007, at 11:49 PM, Chuck Williams wrote:
>>
>>> My interest is transactions, not making doc-id's permanent.
>>> Specifically, the ability to ensure that a group of adds either
>>> all go
>>> into the index or none go into the index, and to ensure that if
>>> none go
>>> into the index that the index is not changed in any way.
>>>
>>> I have UID's but they cannot ensure the latter property, i.e. they
>>> cannot ensure side-effect-free rollbacks.
>>>
>>> Yes, if you have no reliance on internal Lucene structures like
>>> doc-id's
>>> and segments, then that shouldn't matter. But many capabilities
>>> have
>>> such reliance for good reasons. E.g., ParallelReader, which is a
>>> public
>>> supported class in Lucene, requires doc-id synchronization.
>>> There are
>>> similar good reasons for an application to take advantage of doc-
>>> ids.
>>>
>>> Lucene uses doc-id's in many of its API's and so it is not
>>> surprising
>>> that many applications rely on them, and I'm sure misuse them not
>>> fully
>>> understanding the semantics and uncertainties of doc-id changes
>>> due to
>>> merging segments with deletes.
>>>
>>> Applications can use doc-ids for legitimate and beneficial purposes
>>> while remaining semantically valid. Making such capabilities
>>> efficient
>>> and robust in all cases is facilitated by application control
>>> over when
>>> doc-id's and segment structure change at a granularity larger
>>> than the
>>> single Document.
>>>
>>> If I had a vote it would be +1 on the direction Michael has
>>> proposed,
>>> assuming it can be done robustly and without performance penalty.
>>>
>>> Chuck
>>>
>>>
>>> robert engels wrote on 01/15/2007 07:34 PM:
>>>> I honestly think that having a unique OID as an indexed field and
>>>> putting a layer on top of Lucene is the best solution to all of
>>>> this.
>>>> It makes it almost trivial, and you can implement transaction
>>>> handling
>>>> in a variety of ways.
>>>>
>>>> Attempting to make the doc ids "permanent" is a tough challenge,
>>>> considering the orignal design called for them to be "non
>>>> permanent".
>>>>
>>>> It seems doubtful that you cannot have some sort of primary key any
>>>> way and be this concerned about the transactional nature of Lucene.
>>>>
>>>> I vote -1 on all of this. I think it will detract from the
>>>> simple and
>>>> efficient storage mechanism that Lucene uses.
>>>>
>>>> On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
>>>>
>>>>> Ning Li wrote on 01/15/2007 06:29 PM:
>>>>>> On 1/14/07, Michael McCandless <lucene [at] mikemccandless> wrote:
>>>>>>> * The "support deleteDocuments in IndexWriter" (LUCENE-565)
>>>>>>> feature
>>>>>>> could have a more efficient implementation (just like
>>>>>>> Solr) when
>>>>>>> autoCommit is false, because deletes don't need to be
>>>>>>> flushed
>>>>>>> until commit() is called. Whereas, now, they must be
>>>>>>> aggressively
>>>>>>> flushed on each checkpoint.
>>>>>>
>>>>>> If a reader can only open snapshots both for search and for
>>>>>> modification, I think another change is needed besides the ones
>>>>>> listed: assume the latest snapshot is segments_5 and the latest
>>>>>> checkpoint is segmentsx_7 with 2 new segments, then a reader
>>>>>> opens
>>>>>> snapshot segments_5, performs a few deletes and writes a new
>>>>>> checkpoint segmentsx_8. The summary file segmentsx_8 should
>>>>>> include
>>>>>> the 2 new segments which are in segmentsx_7 but not in
>>>>>> segments_5.
>>>>>> Such segments to include are easily identifiable only if they
>>>>>> are not
>>>>>> merged with segments in the latest snapshot... All these won't be
>>>>>> necessary if a reader always opens the latest checkpoint for
>>>>>> modification, which will also support deletion of non-committed
>>>>>> documents.
>>>>> This problem seems worse. I don't see how a reader and a
>>>>> writer can
>>>>> independently compute and write checkpoints. The adds in the
>>>>> writer
>>>>> don't just create new segments, they replace existing ones through
>>>>> merging. And the merging changes doc-ids by expunging
>>>>> deletes. It
>>>>> seems that all deletes must be based on the most recent
>>>>> checkpoint, or
>>>>> merging of checkpoints to create the next snapshot will be
>>>>> considerably
>>>>> more complex.
>>>>>
>>>>> Chuck
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> ---
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> --
>>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>>
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


DORONC at il

Jan 15, 2007, 11:10 PM

Post #18 of 62 (4717 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

The problem Ning pointed out seems to stem from the two roles of
IndexReader:
(1) reading (read only) the Index for searching and for inspecting its
content;
(2) modifying the index by deleting documents;

This is further complicated by the fact that often a reader is used for
search and then returned docs are deleted by docid.

Perhaps one possibility is to define DocumentDeleter as a subclass of
IndexReader searcher. It would always open the top most generation. It
would (as today) fail to delete if it is not the top most generation. It
would support search, but would be recommended to be used only for update
purposes. Mmmm... It is becoming too complex I'm afraid.

So a better (?) option: (1) add to IndexWriter deleteByTerm() (and
deleteByQuery()) (like NewIndexModifier..) - these deletion methods would
then be performed on top most generation - same as addDocument(); (2)
IndexReader delete() methods would fail (as today) if it is not top most
generation - so it would only work when all previous changes were committed
(which is always true if an application is using (the default) auto
commit).

One comment about permanent IDs (PIDs) - I think that Lucene's choice to
not maintain PIDs on behalf of applications is the right way to go. For
efficiency, even if PIDs were maintained by Lucene, internal changing IDs
would exist and low level operations would use those IDs. But in addition
Lucene would need to maintain the mapping between the two - IDs and PIDs -
and notify an application adding a doc what PID was assigned to it, etc.
Seems better to leave this for applications.

Doron

Chuck Williams <chuck [at] manawiz> wrote on 15/01/2007 21:49:05:

> My interest is transactions, not making doc-id's permanent.
> Specifically, the ability to ensure that a group of adds either all go
> into the index or none go into the index, and to ensure that if none go
> into the index that the index is not changed in any way.
>
> I have UID's but they cannot ensure the latter property, i.e. they
> cannot ensure side-effect-free rollbacks.
>
> Yes, if you have no reliance on internal Lucene structures like doc-id's
> and segments, then that shouldn't matter. But many capabilities have
> such reliance for good reasons. E.g., ParallelReader, which is a public
> supported class in Lucene, requires doc-id synchronization. There are
> similar good reasons for an application to take advantage of doc-ids.
>
> Lucene uses doc-id's in many of its API's and so it is not surprising
> that many applications rely on them, and I'm sure misuse them not fully
> understanding the semantics and uncertainties of doc-id changes due to
> merging segments with deletes.
>
> Applications can use doc-ids for legitimate and beneficial purposes
> while remaining semantically valid. Making such capabilities efficient
> and robust in all cases is facilitated by application control over when
> doc-id's and segment structure change at a granularity larger than the
> single Document.
>
> If I had a vote it would be +1 on the direction Michael has proposed,
> assuming it can be done robustly and without performance penalty.
>
> Chuck
>
>
> robert engels wrote on 01/15/2007 07:34 PM:
> > I honestly think that having a unique OID as an indexed field and
> > putting a layer on top of Lucene is the best solution to all of this.
> > It makes it almost trivial, and you can implement transaction handling
> > in a variety of ways.
> >
> > Attempting to make the doc ids "permanent" is a tough challenge,
> > considering the orignal design called for them to be "non permanent".
> >
> > It seems doubtful that you cannot have some sort of primary key any
> > way and be this concerned about the transactional nature of Lucene.
> >
> > I vote -1 on all of this. I think it will detract from the simple and
> > efficient storage mechanism that Lucene uses.
> >
> > On Jan 15, 2007, at 11:19 PM, Chuck Williams wrote:
> >
> >> Ning Li wrote on 01/15/2007 06:29 PM:
> >>> On 1/14/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> >>>> * The "support deleteDocuments in IndexWriter" (LUCENE-565)
feature
> >>>> could have a more efficient implementation (just like Solr) when
> >>>> autoCommit is false, because deletes don't need to be flushed
> >>>> until commit() is called. Whereas, now, they must be
aggressively
> >>>> flushed on each checkpoint.
> >>>
> >>> If a reader can only open snapshots both for search and for
> >>> modification, I think another change is needed besides the ones
> >>> listed: assume the latest snapshot is segments_5 and the latest
> >>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
> >>> snapshot segments_5, performs a few deletes and writes a new
> >>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
> >>> the 2 new segments which are in segmentsx_7 but not in segments_5.
> >>> Such segments to include are easily identifiable only if they are not
> >>> merged with segments in the latest snapshot... All these won't be
> >>> necessary if a reader always opens the latest checkpoint for
> >>> modification, which will also support deletion of non-committed
> >>> documents.
> >> This problem seems worse. I don't see how a reader and a writer can
> >> independently compute and write checkpoints. The adds in the writer
> >> don't just create new segments, they replace existing ones through
> >> merging. And the merging changes doc-ids by expunging deletes. It
> >> seems that all deletes must be based on the most recent checkpoint, or
> >> merging of checkpoints to create the next snapshot will be
considerably
> >> more complex.
> >>
> >> Chuck
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-dev-help [at] lucene
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> > For additional commands, e-mail: java-dev-help [at] lucene
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jan 16, 2007, 3:10 AM

Post #19 of 62 (4712 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

OK, catching up here and trying to merge threads together otherwise
I'm going to lose my mind!:

Chuck Williams wrote:
>
> Ning Li wrote:
>>
>> If a reader can only open snapshots both for search and for
>> modification, I think another change is needed besides the ones
>> listed: assume the latest snapshot is segments_5 and the latest
>> checkpoint is segmentsx_7 with 2 new segments, then a reader opens
>> snapshot segments_5, performs a few deletes and writes a new
>> checkpoint segmentsx_8. The summary file segmentsx_8 should include
>> the 2 new segments which are in segmentsx_7 but not in segments_5.
>> Such segments to include are easily identifiable only if they are not
>> merged with segments in the latest snapshot... All these won't be
>> necessary if a reader always opens the latest checkpoint for
>> modification, which will also support deletion of non-committed
>> documents.
>>
> This problem seems worse. I don't see how a reader and a writer can
> independently compute and write checkpoints. The adds in the writer
> don't just create new segments, they replace existing ones through
> merging. And the merging changes doc-ids by expunging deletes. It
> seems that all deletes must be based on the most recent checkpoint, or
> merging of checkpoints to create the next snapshot will be considerably
> more complex.

Good catch Ning! And, I agree, when a reader plans to make
modifications to the index, I think the best solution is to require
that the reader has opened most recent "segments*_N" (be that a
snapshot or a checkpoint). Really a reader is actually a "writer" in
this context. This means we need a way to open a reader against the
most recent checkpoint as well (I will add that).

This is very much consistent with how a reader now checks if it is
still current when someone first tries to change a del/norm: if it's
not still current (ie, another writer has written a new segments_N
file) then an IOException is raised with "IndexReader out of date and
no longer valid for delete, undelete, or setNorm operations". I think
with explicit commits that same requirement & check would apply.



Chuck Williams wrote:

> My interest is transactions, not making doc-id's permanent.
> Specifically, the ability to ensure that a group of adds either all go
> into the index or none go into the index, and to ensure that if none go
> into the index that the index is not changed in any way.

Right, I see "explicit commits" as a very simple implementation to
provide a powerful base functionality to Lucene. This base
functionality can indeed enable or make easier/more performant many
neat things above it (the permanent docids discussion, Chuck's highly
performant ParallelWriter, delayed flushing of pending deletes, etc)
but I'd like to keep a clean separation and focus first only on making
the most minimal yet self-contained "explicit commits" work and then
separately build out on top of it. Progress not perfection!


Doron Cohon wrote:

> As a database application, to my understanding the (newly suggested)
> transaction support in Lucene is single tx. I can't see how multiple
> tx can be done within Lucene (and I don't think it should be
> done). Even if it was possible, I think indexing would become very
> inefficient. I think the motivation for adding (some) tx support is
> different, and tx support would be minimal, definitely not multiple
> tx.

Ning Li wrote:

> Lastly, hopefully the term "transaction" won't cause any confusion
> since this "explicit commit" is much simpler than database
> transaction where a database can guarantee the ACID properties for
> each of multiple concurrent transactions.

I agree "explicit commits" is in fact a reduced version of the more
general ACID transactions that relational DBs provide. I really don't
want to call it "transactions" for this reason: that label would
automatically oversell the capability, then only to later disappoint
our users. Always best to "under promise and over deliver" and the
label "transactions" would do just the reverse. But yes explicit
commits is basically a "single transaction".



> If I had a vote it would be +1 on the direction Michael has proposed,
> assuming it can be done robustly and without performance penalty.

I don't anticipate any performance issues. The implementation is so
amazingly trivial! The only index format change is a new name for
those segments_N files that were just the automatic checkpoints that
Lucene does. Otherwise the index format is unchanged. And then
additional logic for a reader/writer to decide which one of these to
read/write.

The only really "interesting" change is to the IndexFileDeleter: it
now must be more careful in how it figures out which index files are
safe to delete (this is the part I'm working on now). I will
definitely test performance (with the new benchmarking suite!) but I
don't expect any changes for the better or worse with just "explicit
commits".

The things that then become possible once you have explicit commits
should give us good potential performance improvements, error
recoverability, etc. in the future. But that's the future and I'm
focusing on "now" :)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jan 16, 2007, 3:12 AM

Post #20 of 62 (4728 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

Chuck Williams wrote:
> Michael McCandless wrote on 01/15/2007 01:49 AM:
>> Chuck,
>>
>>> Possibly related, one of the ways I improved concurrency in
>>> ParallelWriter was to break up IndexWriter.addDocument() into one method
>>> to invert the document and create a RAMSegment and a second method that
>>> takes the RAMSegment and merges it into the index. This allows
>>> inversions to be processed in parallel, while merging is already a
>>> critical section. (Side thought: I've been wondering how hard it would
>>> be to make merging not a critical section). I had thought of the method
>>> to take the RAMSegment and merge it to be the "commit" part of
>>> addDocument().
>>>
>>> Your notion of commit is much better and more flexible, but perhaps you
>>> could include this inversion/merge separation as well?
>>
>> I'm a little confused on what this would mean? Do you mean opening up
>> separate public methods: one to invert (and get a segment back) and
>> one to append (and possibly merge) a segment to the index (keeping the
>> existing addDocument that would then just call these two)? How would
>> this buy you more concurrency (since the current method indeed only
>> synchronizes around the merge part)? Oh: would you behind the scenes
>> take each "single doc" segment and pre-merge them privatelyx,
>> concurrently, possibly up to many levels, privately, and then finally
>> add the merged segment into the index? Ie, the beginnings of
>> "concurrent merge" described above?
>>
>> Actually couldn't we do this change today (ie without waiting for
>> explicit commits)? It seems like a separable change.
>
> Yes, I've already made this change so it is independent, creating
> invertDocument(), addInvertedDocument() and abortInvertedDocument().
> This enables more concurrency in ParallelWriter because there are no
> synchronization restrictions at all on calling invertDocument().
> addInvertedDocument() has a synchronization requirement: it can be
> called in parallel for each subdocument corresponding to the same
> document, but not for subdocuments corresponding to different documents
> as this could break the required parallel subindex doc-id
> correspondence. Because addDocument() (which is just
> addInvertedDocument(invertDocument())) contains the call to
> addInvertedDocument() it has the same synchronization requirement,
> preventing the extra parallelism in the invertDocument() calls.
>
> It seemed to me that this could be related to the your explicit-commits
> idea since it also breaks up writes into an uncommitted local portion
> and committed portion.

Ahh I think I see: you needed to tease out that fine detail on what
synchronization is actually required (the fact that sub-documents can
be done entirely in parallel, but cross-documents cannot). And the
sub-documents indeed give you excellent concurrency (if you make lots
of sub-documents) on boxes that have the CPU resources to allocate.
This is a neat change, but I think separate from from explicit commits
so I think we should keep them decoupled at this point.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


ning.li.li at gmail

Jan 16, 2007, 8:55 AM

Post #21 of 62 (4712 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

On 1/16/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> Good catch Ning! And, I agree, when a reader plans to make
> modifications to the index, I think the best solution is to require
> that the reader has opened most recent "segments*_N" (be that a
> snapshot or a checkpoint). Really a reader is actually a "writer" in
> this context. This means we need a way to open a reader against the
> most recent checkpoint as well (I will add that).
>
> This is very much consistent with how a reader now checks if it is
> still current when someone first tries to change a del/norm: if it's
> not still current (ie, another writer has written a new segments_N
> file) then an IOException is raised with "IndexReader out of date and
> no longer valid for delete, undelete, or setNorm operations". I think
> with explicit commits that same requirement & check would apply.

This means a reader can open a checkpoint for search. But the purpose
of "explicit commits" is that only snapshots are opened for search,
not checkpoints. Can we just trust applications won't open a
checkpoint for search? Or should we explicitly guard against it?

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yonik at apache

Jan 16, 2007, 9:05 AM

Post #22 of 62 (4717 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

On 1/15/07, Chuck Williams <chuck [at] manawiz> wrote:
> (Side thought: I've been wondering how hard it would
> be to make merging not a critical section).

It would be very nice if segment merging didn't block the addition of
new documents... it really doesn't need to. I don't think it would be
too hard to fix, but I haven't had the time to tackle it.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


ning.li.li at gmail

Jan 16, 2007, 9:24 AM

Post #23 of 62 (4727 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

On 1/16/07, Yonik Seeley <yonik [at] apache> wrote:
> On 1/15/07, Chuck Williams <chuck [at] manawiz> wrote:
> > (Side thought: I've been wondering how hard it would
> > be to make merging not a critical section).
>
> It would be very nice if segment merging didn't block the addition of
> new documents... it really doesn't need to. I don't think it would be
> too hard to fix, but I haven't had the time to tackle it.

I've had one working for a while now. It's based on LUCENE-565.
Segment merging does not block addition or deletion of documents.

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jan 16, 2007, 12:13 PM

Post #24 of 62 (4724 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

Ning Li wrote:
> On 1/16/07, Michael McCandless <lucene [at] mikemccandless> wrote:
>> Good catch Ning! And, I agree, when a reader plans to make
>> modifications to the index, I think the best solution is to require
>> that the reader has opened most recent "segments*_N" (be that a
>> snapshot or a checkpoint). Really a reader is actually a "writer" in
>> this context. This means we need a way to open a reader against the
>> most recent checkpoint as well (I will add that).
>>
>> This is very much consistent with how a reader now checks if it is
>> still current when someone first tries to change a del/norm: if it's
>> not still current (ie, another writer has written a new segments_N
>> file) then an IOException is raised with "IndexReader out of date and
>> no longer valid for delete, undelete, or setNorm operations". I think
>> with explicit commits that same requirement & check would apply.
>
> This means a reader can open a checkpoint for search. But the purpose
> of "explicit commits" is that only snapshots are opened for search,
> not checkpoints. Can we just trust applications won't open a
> checkpoint for search? Or should we explicitly guard against it?

Ahh good point.

I think I'll add "openForWriting(*)" static methods to IndexReader.
These will acquire the write lock, and will open the latest
segments*_N (commit or checkpoint). This way you can't open a
checkpoint unless there are no others writers on the index.

We could go further and have IndexSearcher not accept an IndexReader
opened against a checkpoint, but I'm included not to check for
(prevent) this, for starters. I'd rather not preclude possibly
interesting future use cases too early.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


DORONC at il

Jan 16, 2007, 12:23 PM

Post #25 of 62 (4720 views)
Permalink
Re: adding "explicit commits" to Lucene? [In reply to]

Michael McCandless <lucene [at] mikemccandless> wrote on 16/01/2007
12:13:47:

> Ning Li wrote:
> > On 1/16/07, Michael McCandless <lucene [at] mikemccandless> wrote:
> >> Good catch Ning! And, I agree, when a reader plans to make
> >> modifications to the index, I think the best solution is to require
> >> that the reader has opened most recent "segments*_N" (be that a
> >> snapshot or a checkpoint). Really a reader is actually a "writer" in
> >> this context. This means we need a way to open a reader against the
> >> most recent checkpoint as well (I will add that).
> >>
> >> This is very much consistent with how a reader now checks if it is
> >> still current when someone first tries to change a del/norm: if it's
> >> not still current (ie, another writer has written a new segments_N
> >> file) then an IOException is raised with "IndexReader out of date and
> >> no longer valid for delete, undelete, or setNorm operations". I think
> >> with explicit commits that same requirement & check would apply.
> >
> > This means a reader can open a checkpoint for search. But the purpose
> > of "explicit commits" is that only snapshots are opened for search,
> > not checkpoints. Can we just trust applications won't open a
> > checkpoint for search? Or should we explicitly guard against it?
>
> Ahh good point.
>
> I think I'll add "openForWriting(*)" static methods to IndexReader.
> These will acquire the write lock, and will open the latest
> segments*_N (commit or checkpoint). This way you can't open a
> checkpoint unless there are no others writers on the index.
>
> We could go further and have IndexSearcher not accept an IndexReader
> opened against a checkpoint, but I'm included not to check for
> (prevent) this, for starters. I'd rather not preclude possibly
> interesting future use cases too early.

Is this blocking applications that first perform a search, in order to
decide which docs to delete by docid?

Two other options in
http://article.gmane.org/gmane.comp.jakarta.lucene.devel/16581 ...?

>
> Mike
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

First page Previous page 1 2 3 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.