Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

Lucene 2.1, soon

 

 

First page Previous page 1 2 3 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


yonik at apache

Jan 16, 2007, 9:16 AM

Post #1 of 58 (6191 views)
Permalink
Lucene 2.1, soon

Lucene 2.1 has been a long time in coming, but I think we should plan
on making a release when the file format changes settle down.

After that, I think we should start making more frequent releases,
which should make make many people's lives easier by
1) give people something more recent to work from w/o having to pick a
trunk version themselves.
2) make it easier to declare the trunk developmental, so developers
only need to be concerned with backward compatibility between actual
releases.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


gsingers at apache

Jan 16, 2007, 9:26 AM

Post #2 of 58 (6101 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

+1

Was thinking the same thing this morning. The changes.txt 2.1
section is getting quite long.

On Jan 16, 2007, at 12:16 PM, Yonik Seeley wrote:

> Lucene 2.1 has been a long time in coming, but I think we should plan
> on making a release when the file format changes settle down.
>
> After that, I think we should start making more frequent releases,
> which should make make many people's lives easier by
> 1) give people something more recent to work from w/o having to pick a
> trunk version themselves.
> 2) make it easier to declare the trunk developmental, so developers
> only need to be concerned with backward compatibility between actual
> releases.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


otis_gospodnetic at yahoo

Jan 16, 2007, 9:48 AM

Post #3 of 58 (6097 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

Same here. As soon as the file format changes settle down.

Otis

----- Original Message ----
From: Grant Ingersoll <gsingers [at] apache>
To: java-dev [at] lucene
Sent: Tuesday, January 16, 2007 12:26:43 PM
Subject: Re: Lucene 2.1, soon

+1

Was thinking the same thing this morning. The changes.txt 2.1
section is getting quite long.

On Jan 16, 2007, at 12:16 PM, Yonik Seeley wrote:

> Lucene 2.1 has been a long time in coming, but I think we should plan
> on making a release when the file format changes settle down.
>
> After that, I think we should start making more frequent releases,
> which should make make many people's lives easier by
> 1) give people something more recent to work from w/o having to pick a
> trunk version themselves.
> 2) make it easier to declare the trunk developmental, so developers
> only need to be concerned with backward compatibility between actual
> releases.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jan 16, 2007, 10:04 AM

Post #4 of 58 (6099 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

+1 for releasing 2.1 soon.

I hope to get explicit commits (LUCENE-710) working, which has a tiny
file format change, and LUCENE-773 (deprecate FSDirectory.getDirectory
methods that take a create arg) completed soon, so we can get them
into 2.1, if possible.

Also +1 on more frequent releases after 2.1!

Mike

Otis Gospodnetic wrote:
> Same here. As soon as the file format changes settle down.
>
> Otis
>
> ----- Original Message ----
> From: Grant Ingersoll <gsingers [at] apache>
> To: java-dev [at] lucene
> Sent: Tuesday, January 16, 2007 12:26:43 PM
> Subject: Re: Lucene 2.1, soon
>
> +1
>
> Was thinking the same thing this morning. The changes.txt 2.1
> section is getting quite long.
>
> On Jan 16, 2007, at 12:16 PM, Yonik Seeley wrote:
>
>> Lucene 2.1 has been a long time in coming, but I think we should plan
>> on making a release when the file format changes settle down.
>>
>> After that, I think we should start making more frequent releases,
>> which should make make many people's lives easier by
>> 1) give people something more recent to work from w/o having to pick a
>> trunk version themselves.
>> 2) make it easier to declare the trunk developmental, so developers
>> only need to be concerned with backward compatibility between actual
>> releases.
>>
>> -Yonik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org
>
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
> LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jan 17, 2007, 2:48 AM

Post #5 of 58 (6094 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

Michael McCandless wrote:
> +1 for releasing 2.1 soon.
>
> I hope to get explicit commits (LUCENE-710) working, which has a tiny
> file format change, and LUCENE-773 (deprecate FSDirectory.getDirectory
> methods that take a create arg) completed soon, so we can get them
> into 2.1, if possible.

Given the healthy discussion and ongoing design iterations for
"explicit commits", I no longer think we should hold back 2.1 for
this.

Correct NFS behaviour (LUCENE-710) will have to wait for at least one
more release.

I will work through LUCENE-773 next.

Are there any other open JIRA issues that we feel should block 2.1?
Last time (2.0 release) Doug suggested voting in JIRA which makes
sense:

http://www.gossamer-threads.com/lists/lucene/java-dev/34810

So if anyone sees open issues in JIRA that they feel should fixed
before we release 2.1, please go vote for them in JIRA, and then we
should mark such bugs with 2.1 Fix Version and fix them!

> Also +1 on more frequent releases after 2.1!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


gsingers at apache

Jan 17, 2007, 3:42 AM

Post #6 of 58 (6113 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

I think since we have already made some file format changes, we
should consider some of the others on the table, namely https://
issues.apache.org/jira/browse/LUCENE-510
which concerns proper UTF-8 storage. The big issue with this one
seems to be performance (and the patch needs to be updated) but it,
as Marvin has stated, is what would allow us to do the Kino merge
model, if desired, and would provide better compatibility w/ our
sibling projects (an important consideration, but should not be the
driver.)

It has 3 votes currently which isn't the most, but is 3 more than
most issues have.

Also, I'm curious as to how many people use NFS in live systems.

On Jan 17, 2007, at 5:48 AM, Michael McCandless wrote:

> Michael McCandless wrote:
>> +1 for releasing 2.1 soon.
>> I hope to get explicit commits (LUCENE-710) working, which has a tiny
>> file format change, and LUCENE-773 (deprecate
>> FSDirectory.getDirectory
>> methods that take a create arg) completed soon, so we can get them
>> into 2.1, if possible.
>
> Given the healthy discussion and ongoing design iterations for
> "explicit commits", I no longer think we should hold back 2.1 for
> this.
>
> Correct NFS behaviour (LUCENE-710) will have to wait for at least one
> more release.
>
> I will work through LUCENE-773 next.
>
> Are there any other open JIRA issues that we feel should block 2.1?
> Last time (2.0 release) Doug suggested voting in JIRA which makes
> sense:
>
> http://www.gossamer-threads.com/lists/lucene/java-dev/34810
>
> So if anyone sees open issues in JIRA that they feel should fixed
> before we release 2.1, please go vote for them in JIRA, and then we
> should mark such bugs with 2.1 Fix Version and fix them!
>
>> Also +1 on more frequent releases after 2.1!
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


chuck at manawiz

Jan 17, 2007, 10:15 AM

Post #7 of 58 (6092 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

Grant Ingersoll wrote on 01/17/2007 01:42 AM:
> Also, I'm curious as to how many people use NFS in live systems.
>

I've got the requirement to support large indexes and collections of
indexes on NAS devices, which from linux pretty much means NFS or CIFS.

This doesn't seem unusual.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


marvin at rectangular

Jan 17, 2007, 12:21 PM

Post #8 of 58 (6113 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

On Jan 17, 2007, at 3:42 AM, Grant Ingersoll wrote:

> I think since we have already made some file format changes, we
> should consider some of the others on the table, namely https://
> issues.apache.org/jira/browse/LUCENE-510
> which concerns proper UTF-8 storage. The big issue with this one
> seems to be performance (and the patch needs to be updated) but it,
> as Marvin has stated, is what would allow us to do the Kino merge
> model, if desired, and would provide better compatibility w/ our
> sibling projects (an important consideration, but should not be the
> driver.)

I'm pleased to see you bring these to the fore, as they're issues I
care about and have spent considerable time on. However, I would not
hold up a 2.1 release for either proper UTF-8 or bytecount strings.

In addition to the performance considerations, the patch as it
currently stands completely destroys backwards compatibility -- only
indexes consisting of pure ASCII source material created pre-patch
are still able to be read post-patch.

Switching to official UTF-8 on it's own is not backwards compatible
as discussed here: <http://xrl.us/uawo> (Link to mail-
archives.apache.org)

Switching to bytecount based strings is likewise a real headache for
backwards compat. We might have to do something like subclass
IndexInput and IndexOutput and choose a version based on segment
format. Even then, it's tricky because of how to deal with string
diffs.

I promise to update the bytecounts/utf8 patch after KS 0.20_01 is
done, but I can't get to it before that. There's a lot of pressure
on me to get a new version of KS out the door.

Your more general point about batching up file format changes
reflects what I've always thought, but I wonder... Doug has laid out
a backwards compatibility policy about always reading stuff written
one major version back. It occurs to me that the more frequently
major versions get released, the more quickly we can dispense with
crufty compatibility code. :)

> Also, I'm curious as to how many people use NFS in live systems.

KS has the same problems Lucene does, and it's a common enough
complaint that I've added an FAQ item. It's an important issue.

However, I don't have the faintest idea how to solve it.

So unless someone comes up with something simple and brilliant, I
don't think it should stand in the way, either.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jan 17, 2007, 1:16 PM

Post #9 of 58 (6086 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

Marvin Humphrey wrote:

> On Jan 17, 2007, at 3:42 AM, Grant Ingersoll wrote:
>>
>> Also, I'm curious as to how many people use NFS in live systems.
>
> KS has the same problems Lucene does, and it's a common enough complaint
> that I've added an FAQ item. It's an important issue.

I agree it's important. Our users naturally assume KS and Lucene are
usable over NFS (or any filesystem). And NFS is obviously very common
since it's the standard remote filesystem for Unix.

The less we rely on filesystem specifics such as "what happens if you
delete an open file", the more portable we will be.

> However, I don't have the faintest idea how to solve it.
>
> So unless someone comes up with something simple and brilliant, I don't
> think it should stand in the way, either.

This is the solution I have in mind for LUCENE-710: change the
IndexFileDeleter so that instead of always immediately deleting the
last commit when a new commit happens, allow some time before doing
so. This way readers have a chance to refresh. The actual time would
be settable by the developer. So if you set it to 6 hours, then, a
commit would remain usable for at least 6 hours after it had been
obsoleted by a new commit. This means if you can ensure your readers
refresh within 6 hours of a new commit happening, then the writer will
never delete an "in-use" commit.

I don't think we should necessarily hold up 2.1 for this change,
though.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


marvin at rectangular

Jan 18, 2007, 11:24 AM

Post #10 of 58 (6085 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

On Jan 17, 2007, at 1:16 PM, Michael McCandless wrote:

> This is the solution I have in mind for LUCENE-710: change the
> IndexFileDeleter so that instead of always immediately deleting the
> last commit when a new commit happens, allow some time before doing
> so. This way readers have a chance to refresh. The actual time would
> be settable by the developer. So if you set it to 6 hours, then, a
> commit would remain usable for at least 6 hours after it had been
> obsoleted by a new commit. This means if you can ensure your readers
> refresh within 6 hours of a new commit happening, then the writer will
> never delete an "in-use" commit.

I've been mulling this over. If you set the interval to 6 hours, and
there's a lot of churn (e.g. if you optimize frequently), you'll end
up with a lot of wasted disk space. On the flip side, the user has
to set up some sort of trigger for refreshing the IndexReaders
anyway. It's still not user-friendly by default, and we'd be
polluting the API with a hateful workaround.

The real problem is NFS. For background, see <http://
nfs.sourceforge.net/#section_d>, item D2, which deals with NFS and
"delete on last close".

Now I wonder. Version 4 of the NFS protocol introduces state, so
it's possible to implement file locking. Can we lock a segments
file, then have IndexFileDeleter detect which segments are locked
that way? And if that's the case, can we detect whether the locking
mechanism is failing and throw an exception if someone tries to use
an earlier version of NFS?

I'd be cool with making it impossible to put an index on an NFS
volume prior to version 4. That puts the blame where it belongs.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


marvin at rectangular

Jan 18, 2007, 1:40 PM

Post #11 of 58 (6091 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

I wrote:
> I'd be cool with making it impossible to put an index on an NFS
> volume prior to version 4.

Elaborating and clarifying...

IndexReader attempts to establish a read lock on the relevant
segments_N file. It doesn't bother to see whether the locking
attempt succeeds, though.

IndexFileDeleter, before deleting any files, always touches a test
file, attempts to lock it, and verifies that the lock succeeds. If
the locking test fails, it throws an exception rather than proceed.

In addition, the locking test is run at index creation time, so that
the user knows as soon as possible that their index is in a
problematic location.

I think the only way this would fail under NFS is if the client
machine with the reader is using NFS version 3, while the machine
with the writer is using version 4. But before this issue arose I
didn't have that much experience with the intricacies of NFS, so I
could be off-base.

This does bring back the permissions issue with IndexReader. A
search app may not have permission to establish a read lock on a file
within the index directory, and in that case, an IndexFileDeleter
could delete files out from under it.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


chuck at manawiz

Jan 18, 2007, 1:58 PM

Post #12 of 58 (6112 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

How about a direct solution with a reference count scheme?

Segments files could be reference-counted, as well as individual
segments either directly, possibly by interning SegmentInfo instances,
or indirectly by reference counting all files via Directory.

The most recent checkpoint and snapshot would have an implicit reference
since they can be opened. Each reader and writer creates a reference
when it opens a segments file.

This way segments files and each segment's files would be deleted
precisely when they are no longer used, which would both support NFS and
improve performance on Windows.

Chuck


Marvin Humphrey wrote on 01/18/2007 11:40 AM:
>
> I wrote:
>> I'd be cool with making it impossible to put an index on an NFS
>> volume prior to version 4.
>
> Elaborating and clarifying...
>
> IndexReader attempts to establish a read lock on the relevant
> segments_N file. It doesn't bother to see whether the locking attempt
> succeeds, though.
>
> IndexFileDeleter, before deleting any files, always touches a test
> file, attempts to lock it, and verifies that the lock succeeds. If
> the locking test fails, it throws an exception rather than proceed.
>
> In addition, the locking test is run at index creation time, so that
> the user knows as soon as possible that their index is in a
> problematic location.
>
> I think the only way this would fail under NFS is if the client
> machine with the reader is using NFS version 3, while the machine with
> the writer is using version 4. But before this issue arose I didn't
> have that much experience with the intricacies of NFS, so I could be
> off-base.
>
> This does bring back the permissions issue with IndexReader. A search
> app may not have permission to establish a read lock on a file within
> the index directory, and in that case, an IndexFileDeleter could
> delete files out from under it.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


marvin at rectangular

Jan 18, 2007, 2:10 PM

Post #13 of 58 (6085 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

On Jan 18, 2007, at 1:58 PM, Chuck Williams wrote:

> How about a direct solution with a reference count scheme?
>
> Segments files could be reference-counted,

There would have to be a file where the refcounts are maintained.
The problem is that if an IndexReader crashes, it could orphan a
refcount, so the files the reader was "using" would never get reclaimed.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Jan 18, 2007, 2:12 PM

Post #14 of 58 (6090 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

This won't work with multiple JVMs attached to the same Lucene
directory.

All JVMs need to vote as whether or not certain segments can be
deleted, since the others JVMS can't know this. How you do this...


On Jan 18, 2007, at 3:58 PM, Chuck Williams wrote:

> How about a direct solution with a reference count scheme?
>
> Segments files could be reference-counted, as well as individual
> segments either directly, possibly by interning SegmentInfo instances,
> or indirectly by reference counting all files via Directory.
>
> The most recent checkpoint and snapshot would have an implicit
> reference
> since they can be opened. Each reader and writer creates a reference
> when it opens a segments file.
>
> This way segments files and each segment's files would be deleted
> precisely when they are no longer used, which would both support
> NFS and
> improve performance on Windows.
>
> Chuck
>
>
> Marvin Humphrey wrote on 01/18/2007 11:40 AM:
>>
>> I wrote:
>>> I'd be cool with making it impossible to put an index on an NFS
>>> volume prior to version 4.
>>
>> Elaborating and clarifying...
>>
>> IndexReader attempts to establish a read lock on the relevant
>> segments_N file. It doesn't bother to see whether the locking
>> attempt
>> succeeds, though.
>>
>> IndexFileDeleter, before deleting any files, always touches a test
>> file, attempts to lock it, and verifies that the lock succeeds. If
>> the locking test fails, it throws an exception rather than proceed.
>>
>> In addition, the locking test is run at index creation time, so that
>> the user knows as soon as possible that their index is in a
>> problematic location.
>>
>> I think the only way this would fail under NFS is if the client
>> machine with the reader is using NFS version 3, while the machine
>> with
>> the writer is using version 4. But before this issue arose I didn't
>> have that much experience with the intricacies of NFS, so I could be
>> off-base.
>>
>> This does bring back the permissions issue with IndexReader. A
>> search
>> app may not have permission to establish a read lock on a file within
>> the index directory, and in that case, an IndexFileDeleter could
>> delete files out from under it.
>>
>> Marvin Humphrey
>> Rectangular Research
>> http://www.rectangular.com/
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jan 18, 2007, 2:17 PM

Post #15 of 58 (6088 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

Marvin Humphrey wrote:
>
> On Jan 18, 2007, at 1:58 PM, Chuck Williams wrote:
>
>> How about a direct solution with a reference count scheme?
>>
>> Segments files could be reference-counted,
>
> There would have to be a file where the refcounts are maintained. The
> problem is that if an IndexReader crashes, it could orphan a refcount,
> so the files the reader was "using" would never get reclaimed.

How about if each reader were assigned a unique ID (eg hostname) by
the application, and wrote a file ($ID.inuse or something) into the
index dir referencing the segments_N that it's currently using? Say
the searcher/reader touches this file periodically so writer can
detect that a reader is no longer alive. This wouldn't require
locking (which scares me on NFS) and I think should work?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Jan 18, 2007, 2:20 PM

Post #16 of 58 (6088 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

You would also have to add a requirement that readers touch the file
every N minutes, otherwise dead users will prevent cleanup

On Jan 18, 2007, at 4:17 PM, Michael McCandless wrote:

> Marvin Humphrey wrote:
>> On Jan 18, 2007, at 1:58 PM, Chuck Williams wrote:
>>> How about a direct solution with a reference count scheme?
>>>
>>> Segments files could be reference-counted,
>> There would have to be a file where the refcounts are maintained.
>> The problem is that if an IndexReader crashes, it could orphan a
>> refcount, so the files the reader was "using" would never get
>> reclaimed.
>
> How about if each reader were assigned a unique ID (eg hostname) by
> the application, and wrote a file ($ID.inuse or something) into the
> index dir referencing the segments_N that it's currently using? Say
> the searcher/reader touches this file periodically so writer can
> detect that a reader is no longer alive. This wouldn't require
> locking (which scares me on NFS) and I think should work?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jan 18, 2007, 2:24 PM

Post #17 of 58 (6095 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

Marvin Humphrey wrote:
>
> On Jan 17, 2007, at 1:16 PM, Michael McCandless wrote:
>
>> This is the solution I have in mind for LUCENE-710: change the
>> IndexFileDeleter so that instead of always immediately deleting the
>> last commit when a new commit happens, allow some time before doing
>> so. This way readers have a chance to refresh. The actual time would
>> be settable by the developer. So if you set it to 6 hours, then, a
>> commit would remain usable for at least 6 hours after it had been
>> obsoleted by a new commit. This means if you can ensure your readers
>> refresh within 6 hours of a new commit happening, then the writer will
>> never delete an "in-use" commit.
>
> I've been mulling this over. If you set the interval to 6 hours, and
> there's a lot of churn (e.g. if you optimize frequently), you'll end up
> with a lot of wasted disk space. On the flip side, the user has to set
> up some sort of trigger for refreshing the IndexReaders anyway. It's
> still not user-friendly by default, and we'd be polluting the API with a
> hateful workaround.

Well, 6 hours would be a long time for such a high turnover site.
They would presumably set the time to something like 10 minutes
instead.

I think we should decouple the deletion policy from commits. This way
developers could subclass and make their own deletion policy that
suits their application. The IndexFileDeleter base class would do all
the legwork to keep ref counts to all specific index files based on
all segments_N commits that are still "live". Then the deletion
policy just decides which commits should be deleted, when. (This is
roughly what's outlined in LUCENE-710).

The current policy is to delete all prior commits after a new commit
and that would remain the default.

Chuck's idea (reference counting via filesystem) would be another
policy. My proposal (delete by time after being obsoleted) would be
another policy, etc.

> The real problem is NFS. For background, see
> <http://nfs.sourceforge.net/#section_d>, item D2, which deals with NFS
> and "delete on last close".
>
> Now I wonder. Version 4 of the NFS protocol introduces state, so it's
> possible to implement file locking. Can we lock a segments file, then
> have IndexFileDeleter detect which segments are locked that way? And if
> that's the case, can we detect whether the locking mechanism is failing
> and throw an exception if someone tries to use an earlier version of NFS?

Locking and NFS makes me very nervous :)

> I'd be cool with making it impossible to put an index on an NFS volume
> prior to version 4. That puts the blame where it belongs.

Well, most times users have no control over which NFS server and/or
client version is in use, so I think taking this approach of "pinning
the blame" can only hurt our users. I would rather find a solution
that's more portable, if we can (like the ref counting idea Chuck
brought up).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


marvin at rectangular

Jan 18, 2007, 2:39 PM

Post #18 of 58 (6093 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

On Jan 18, 2007, at 2:17 PM, Michael McCandless wrote:

> How about if each reader were assigned a unique ID (eg hostname) by
> the application, and wrote a file ($ID.inuse or something) into the
> index dir referencing the segments_N that it's currently using?

It would have to go in the old /tmp lock dir to deal with the
permissions issue.

I really prefer to have locking managed via the index dir rather than
a shared dir within /tmp.

Without that, multiple machines attempting to write to an index on a
shared volume don't know about each other unless you specifically
manage the locking mechanism -- and thus may corrupt an index.

Plus, it's simpler and better for all the reasons outlined a couple
weeks ago.

> This wouldn't require
> locking (which scares me on NFS)

Well, the thing about the scheme using file locking is that it tests
the locking mechanism once per Directory per session before executing
any delete ops. There's a cost for doing this, but I don't think
it's significant in the grand scheme.

The touching mechanism scares me. :) It's hard to guarantee that it
will always occur in a timely manner.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


rengels at ix

Jan 18, 2007, 2:47 PM

Post #19 of 58 (6096 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

The touching doesn't have to be that timely.

If the indexed is configured to only keep old segments less than x
hours old, you just check if any of the timestamps are with x hours,
and if not you can delete the segments. So even if the reader is late
updating the timestamp - they would need to be VERY late.

You also need synchronization between the segment deleters, since you
wouldn't want everyone doing it.


On Jan 18, 2007, at 4:39 PM, Marvin Humphrey wrote:

>
> On Jan 18, 2007, at 2:17 PM, Michael McCandless wrote:
>
>> How about if each reader were assigned a unique ID (eg hostname) by
>> the application, and wrote a file ($ID.inuse or something) into the
>> index dir referencing the segments_N that it's currently using?
>
> It would have to go in the old /tmp lock dir to deal with the
> permissions issue.
>
> I really prefer to have locking managed via the index dir rather
> than a shared dir within /tmp.
>
> Without that, multiple machines attempting to write to an index on
> a shared volume don't know about each other unless you specifically
> manage the locking mechanism -- and thus may corrupt an index.
>
> Plus, it's simpler and better for all the reasons outlined a couple
> weeks ago.
>
>> This wouldn't require
>> locking (which scares me on NFS)
>
> Well, the thing about the scheme using file locking is that it
> tests the locking mechanism once per Directory per session before
> executing any delete ops. There's a cost for doing this, but I
> don't think it's significant in the grand scheme.
>
> The touching mechanism scares me. :) It's hard to guarantee that it
> will always occur in a timely manner.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


DORONC at il

Jan 18, 2007, 2:59 PM

Post #20 of 58 (6097 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

I am not happy with complicating the readers like this, conceptually
adding back commit locks (for deletion), this time with a keep-a-life
thread, and again making readers not read-only.

To my understanding the only remaining issue with NFS is: a reader
might get an IO exception in case writer removed an old file that
the reader is using.

It is not a possible corruption that we try to solve, right?

For that I think it is not worth to add that stuff again.

A writer's "two steps" policy - delete only files that
"would have not been in use unless a reader did not refresh for X minutes"
is "fair enough" I think.

By "two steps" I mean, start measuring time not from when segment to be
deleted was created, but rather from when its "next generation" was
created.

Michael McCandless <lucene [at] mikemccandless> wrote on 18/01/2007
14:24:16:

> Marvin Humphrey wrote:
> >
> > On Jan 17, 2007, at 1:16 PM, Michael McCandless wrote:
> >
> >> This is the solution I have in mind for LUCENE-710: change the
> >> IndexFileDeleter so that instead of always immediately deleting the
> >> last commit when a new commit happens, allow some time before doing
> >> so. This way readers have a chance to refresh. The actual time would
> >> be settable by the developer. So if you set it to 6 hours, then, a
> >> commit would remain usable for at least 6 hours after it had been
> >> obsoleted by a new commit. This means if you can ensure your readers
> >> refresh within 6 hours of a new commit happening, then the writer will
> >> never delete an "in-use" commit.
> >
> > I've been mulling this over. If you set the interval to 6 hours, and
> > there's a lot of churn (e.g. if you optimize frequently), you'll end up

> > with a lot of wasted disk space. On the flip side, the user has to set

> > up some sort of trigger for refreshing the IndexReaders anyway. It's
> > still not user-friendly by default, and we'd be polluting the API with
a
> > hateful workaround.
>
> Well, 6 hours would be a long time for such a high turnover site.
> They would presumably set the time to something like 10 minutes
> instead.
>
> I think we should decouple the deletion policy from commits. This way
> developers could subclass and make their own deletion policy that
> suits their application. The IndexFileDeleter base class would do all
> the legwork to keep ref counts to all specific index files based on
> all segments_N commits that are still "live". Then the deletion
> policy just decides which commits should be deleted, when. (This is
> roughly what's outlined in LUCENE-710).
>
> The current policy is to delete all prior commits after a new commit
> and that would remain the default.
>
> Chuck's idea (reference counting via filesystem) would be another
> policy. My proposal (delete by time after being obsoleted) would be
> another policy, etc.
>
> > The real problem is NFS. For background, see
> > <http://nfs.sourceforge.net/#section_d>, item D2, which deals with NFS
> > and "delete on last close".
> >
> > Now I wonder. Version 4 of the NFS protocol introduces state, so it's
> > possible to implement file locking. Can we lock a segments file, then
> > have IndexFileDeleter detect which segments are locked that way? And
if
> > that's the case, can we detect whether the locking mechanism is failing

> > and throw an exception if someone tries to use an earlier version of
NFS?
>
> Locking and NFS makes me very nervous :)
>
> > I'd be cool with making it impossible to put an index on an NFS volume
> > prior to version 4. That puts the blame where it belongs.
>
> Well, most times users have no control over which NFS server and/or
> client version is in use, so I think taking this approach of "pinning
> the blame" can only hurt our users. I would rather find a solution
> that's more portable, if we can (like the ref counting idea Chuck
> brought up).
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


marvin at rectangular

Jan 18, 2007, 4:39 PM

Post #21 of 58 (6087 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

On Jan 18, 2007, at 2:24 PM, Michael McCandless wrote:
> I think we should decouple the deletion policy from commits. This way
> developers could subclass and make their own deletion policy that
> suits their application.

But your excellent work has brought us so close to just handling all
deletions transparently!

I hate expanding or compromising APIs to deal with implementation-
specific corner-case bugs. It's bad design.

> Well, most times users have no control over which NFS server and/or
> client version is in use,

I wonder what the penetration of NFS version 4 is.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


marvin at rectangular

Jan 18, 2007, 4:42 PM

Post #22 of 58 (6092 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

On Jan 18, 2007, at 2:59 PM, Doron Cohen wrote:

> To my understanding the only remaining issue with NFS is: a reader
> might get an IO exception in case writer removed an old file that
> the reader is using.
>
> It is not a possible corruption that we try to solve, right?
>
> For that I think it is not worth to add that stuff again.

I agree, Doron.

I'd rather leave NFS as a problem case.

Now... how about having Readers establish advisory read locks when
the operating system supports them? That seems to me to be still in
the spirit having Readers be read-only.

Then our problem set is reduced even further: only NFS systems using
protocols prior to version 4.

It's probably not even worth it to perform the lock test I proposed
earlier. We just use file systems the way they're suppose to behave
and eventually NFS catches up.

Provisionally, this is what I intend implement for KS, unless
something better emerges from ongoing discussion.

> A writer's "two steps" policy - delete only files that
> "would have not been in use unless a reader did not refresh for X
> minutes"
> is "fair enough" I think.
>
> By "two steps" I mean, start measuring time not from when segment
> to be
> deleted was created, but rather from when its "next generation" was
> created.

Deletions are processed shortly after a new segments_N file gets
written (at least on KS, and IIRC also for Lucene). You'd always
have to leave deletes to the next commit.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


lucene at mikemccandless

Jan 18, 2007, 5:37 PM

Post #23 of 58 (6096 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

Doron Cohen wrote:
> I am not happy with complicating the readers like this, conceptually
> adding back commit locks (for deletion), this time with a keep-a-life
> thread, and again making readers not read-only.
>
> To my understanding the only remaining issue with NFS is: a reader
> might get an IO exception in case writer removed an old file that
> the reader is using.
>
> It is not a possible corruption that we try to solve, right?
>
> For that I think it is not worth to add that stuff again.
>
> A writer's "two steps" policy - delete only files that
> "would have not been in use unless a reader did not refresh for X minutes"
> is "fair enough" I think.
>
> By "two steps" I mean, start measuring time not from when segment to be
> deleted was created, but rather from when its "next generation" was
> created.

Right, this was my original proposed deletion policy (below) for
things to work on NFS.

It does assume/require your application can refresh readers within the
specified time period. A commit (and any segments that then ref count
to zero) gets deleted after they have been "obsoleted" for more than X
minutes.

Even though it's not perfect (progress not perfection!), I like it the
best of the three options discussed on this thread so far because 1)
it leaves the readers read only, and 2) it should work on all versions
of NFS.

This would just be a different deletion policy, and it wouldn't be the
default one. We would leave the default as "keep only last commit
and delete old one immediately", for backwards compatibility.

Finally, an application can always make their own deletion policy
(subclass IndexFileDeleter) if they need to.

Mike

> Michael McCandless <lucene [at] mikemccandless> wrote on 18/01/2007
> 14:24:16:
>
>> Marvin Humphrey wrote:
>>> On Jan 17, 2007, at 1:16 PM, Michael McCandless wrote:
>>>
>>>> This is the solution I have in mind for LUCENE-710: change the
>>>> IndexFileDeleter so that instead of always immediately deleting the
>>>> last commit when a new commit happens, allow some time before doing
>>>> so. This way readers have a chance to refresh. The actual time would
>>>> be settable by the developer. So if you set it to 6 hours, then, a
>>>> commit would remain usable for at least 6 hours after it had been
>>>> obsoleted by a new commit. This means if you can ensure your readers
>>>> refresh within 6 hours of a new commit happening, then the writer will
>>>> never delete an "in-use" commit.
>>> I've been mulling this over. If you set the interval to 6 hours, and
>>> there's a lot of churn (e.g. if you optimize frequently), you'll end up
>
>>> with a lot of wasted disk space. On the flip side, the user has to set
>
>>> up some sort of trigger for refreshing the IndexReaders anyway. It's
>>> still not user-friendly by default, and we'd be polluting the API with
> a
>>> hateful workaround.
>> Well, 6 hours would be a long time for such a high turnover site.
>> They would presumably set the time to something like 10 minutes
>> instead.
>>
>> I think we should decouple the deletion policy from commits. This way
>> developers could subclass and make their own deletion policy that
>> suits their application. The IndexFileDeleter base class would do all
>> the legwork to keep ref counts to all specific index files based on
>> all segments_N commits that are still "live". Then the deletion
>> policy just decides which commits should be deleted, when. (This is
>> roughly what's outlined in LUCENE-710).
>>
>> The current policy is to delete all prior commits after a new commit
>> and that would remain the default.
>>
>> Chuck's idea (reference counting via filesystem) would be another
>> policy. My proposal (delete by time after being obsoleted) would be
>> another policy, etc.
>>
>>> The real problem is NFS. For background, see
>>> <http://nfs.sourceforge.net/#section_d>, item D2, which deals with NFS
>>> and "delete on last close".
>>>
>>> Now I wonder. Version 4 of the NFS protocol introduces state, so it's
>>> possible to implement file locking. Can we lock a segments file, then
>>> have IndexFileDeleter detect which segments are locked that way? And
> if
>>> that's the case, can we detect whether the locking mechanism is failing
>
>>> and throw an exception if someone tries to use an earlier version of
> NFS?
>> Locking and NFS makes me very nervous :)
>>
>>> I'd be cool with making it impossible to put an index on an NFS volume
>>> prior to version 4. That puts the blame where it belongs.
>> Well, most times users have no control over which NFS server and/or
>> client version is in use, so I think taking this approach of "pinning
>> the blame" can only hurt our users. I would rather find a solution
>> that's more portable, if we can (like the ref counting idea Chuck
>> brought up).
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


chuck at manawiz

Jan 18, 2007, 10:54 PM

Post #24 of 58 (6093 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

I need to support NFS and would not want to rely on the reader
refreshing in X minutes. Setting X too small risks a query failure and
setting X too large wastes disk space. X would need to be set for 100%
reader availability, implying a large value and a lot of disk space waste.

I like the idea of customizable delete policies in IndexFileDeleter. My
current application does not have the need for multiple processes
accessing the same index, only many threads in a single process. There
are multiple processes cooperating, but each has its own piece of the
index stored separately. So, an in-memory reference count scheme would
work best.

The point is that different applications have different needs. This
could be addressed well by ensuring that IndexFileDeleter is nicely
customizable and has a few common policies available such as: delete
immediately (current), delete after obsolete for X minutes, keep
in-memory reference counts, and keep persistent reference counts. These
strategies might be used respectively by: linux or windows app with
local file system, multiple processes sharing an index on nfs, single
process with an index on nfs or more efficient strategy for single
process on Windows, alternative solution for multiple processes with an
index on nfs.

Reference count schemes might best be done at the Directory level,
analogous to what Linux does. So long as all readers and writer use the
same Directory it is easy to keep reference counts.

Perhaps IndexFileDeleter should be integrated into Directory?

Of course one might complain that this is throwing in the towel,
implementing a bunch of options instead of one elegant solution.

Chuck


Michael McCandless wrote on 01/18/2007 03:37 PM:
> Doron Cohen wrote:
>> I am not happy with complicating the readers like this, conceptually
>> adding back commit locks (for deletion), this time with a keep-a-life
>> thread, and again making readers not read-only.
>>
>> To my understanding the only remaining issue with NFS is: a reader
>> might get an IO exception in case writer removed an old file that
>> the reader is using.
>>
>> It is not a possible corruption that we try to solve, right?
>>
>> For that I think it is not worth to add that stuff again.
>>
>> A writer's "two steps" policy - delete only files that
>> "would have not been in use unless a reader did not refresh for X
>> minutes"
>> is "fair enough" I think.
>>
>> By "two steps" I mean, start measuring time not from when segment to be
>> deleted was created, but rather from when its "next generation" was
>> created.
>
> Right, this was my original proposed deletion policy (below) for
> things to work on NFS.
>
> It does assume/require your application can refresh readers within the
> specified time period. A commit (and any segments that then ref count
> to zero) gets deleted after they have been "obsoleted" for more than X
> minutes.
>
> Even though it's not perfect (progress not perfection!), I like it the
> best of the three options discussed on this thread so far because 1)
> it leaves the readers read only, and 2) it should work on all versions
> of NFS.
>
> This would just be a different deletion policy, and it wouldn't be the
> default one. We would leave the default as "keep only last commit
> and delete old one immediately", for backwards compatibility.
>
> Finally, an application can always make their own deletion policy
> (subclass IndexFileDeleter) if they need to.
>
> Mike
>
>> Michael McCandless <lucene [at] mikemccandless> wrote on 18/01/2007
>> 14:24:16:
>>
>>> Marvin Humphrey wrote:
>>>> On Jan 17, 2007, at 1:16 PM, Michael McCandless wrote:
>>>>
>>>>> This is the solution I have in mind for LUCENE-710: change the
>>>>> IndexFileDeleter so that instead of always immediately deleting the
>>>>> last commit when a new commit happens, allow some time before doing
>>>>> so. This way readers have a chance to refresh. The actual time
>>>>> would
>>>>> be settable by the developer. So if you set it to 6 hours, then, a
>>>>> commit would remain usable for at least 6 hours after it had been
>>>>> obsoleted by a new commit. This means if you can ensure your readers
>>>>> refresh within 6 hours of a new commit happening, then the writer
>>>>> will
>>>>> never delete an "in-use" commit.
>>>> I've been mulling this over. If you set the interval to 6 hours, and
>>>> there's a lot of churn (e.g. if you optimize frequently), you'll
>>>> end up
>>
>>>> with a lot of wasted disk space. On the flip side, the user has to
>>>> set
>>
>>>> up some sort of trigger for refreshing the IndexReaders anyway. It's
>>>> still not user-friendly by default, and we'd be polluting the API with
>> a
>>>> hateful workaround.
>>> Well, 6 hours would be a long time for such a high turnover site.
>>> They would presumably set the time to something like 10 minutes
>>> instead.
>>>
>>> I think we should decouple the deletion policy from commits. This way
>>> developers could subclass and make their own deletion policy that
>>> suits their application. The IndexFileDeleter base class would do all
>>> the legwork to keep ref counts to all specific index files based on
>>> all segments_N commits that are still "live". Then the deletion
>>> policy just decides which commits should be deleted, when. (This is
>>> roughly what's outlined in LUCENE-710).
>>>
>>> The current policy is to delete all prior commits after a new commit
>>> and that would remain the default.
>>>
>>> Chuck's idea (reference counting via filesystem) would be another
>>> policy. My proposal (delete by time after being obsoleted) would be
>>> another policy, etc.
>>>
>>>> The real problem is NFS. For background, see
>>>> <http://nfs.sourceforge.net/#section_d>, item D2, which deals with NFS
>>>> and "delete on last close".
>>>>
>>>> Now I wonder. Version 4 of the NFS protocol introduces state, so it's
>>>> possible to implement file locking. Can we lock a segments file, then
>>>> have IndexFileDeleter detect which segments are locked that way? And
>> if
>>>> that's the case, can we detect whether the locking mechanism is
>>>> failing
>>
>>>> and throw an exception if someone tries to use an earlier version of
>> NFS?
>>> Locking and NFS makes me very nervous :)
>>>
>>>> I'd be cool with making it impossible to put an index on an NFS volume
>>>> prior to version 4. That puts the blame where it belongs.
>>> Well, most times users have no control over which NFS server and/or
>>> client version is in use, so I think taking this approach of "pinning
>>> the blame" can only hurt our users. I would rather find a solution
>>> that's more portable, if we can (like the ref counting idea Chuck
>>> brought up).
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


DORONC at il

Jan 18, 2007, 10:57 PM

Post #25 of 58 (6093 views)
Permalink
Re: Lucene 2.1, soon [In reply to]

Sounds good to me.

So it is IndexFileDeleter that can be used by applications to guarantee
"their" NFS-safe behavior, namely preventing premature files deletions.
Cool. We can probably sometimes write one such alternative, even in
contrib.

But, should enabling this way of extending IndexFileDeleter be part
of coming 2.1 release, or is it just a future wish?

Because I am not sure that current interfaces of/with IndexFileDeleter
are sufficient for this:

1) IndexWriter does not expose setDeleter().
It should probably somehow be in the constructor, because already at
that time files are deleted. (or found).

2) IndexReader allows setting the deleter, but only after the reader
is open. This is okay for its role in commit() (deleting).
But this might be too late for its new role (touching) - some writer
may be deciding to delete files in between.

There are more questions, but no point in getting to them unless this
extendibility is intended for 2.1. (?)

Michael McCandless <lucene [at] mikemccandless> wrote on 18/01/2007
17:37:57:

> Doron Cohen wrote:
> > I am not happy with complicating the readers like this, conceptually
> > adding back commit locks (for deletion), this time with a keep-a-life
> > thread, and again making readers not read-only.
> >
> > To my understanding the only remaining issue with NFS is: a reader
> > might get an IO exception in case writer removed an old file that
> > the reader is using.
> >
> > It is not a possible corruption that we try to solve, right?
> >
> > For that I think it is not worth to add that stuff again.
> >
> > A writer's "two steps" policy - delete only files that
> > "would have not been in use unless a reader did not refresh for X
minutes"
> > is "fair enough" I think.
> >
> > By "two steps" I mean, start measuring time not from when segment to be
> > deleted was created, but rather from when its "next generation" was
> > created.
>
> Right, this was my original proposed deletion policy (below) for
> things to work on NFS.
>
> It does assume/require your application can refresh readers within the
> specified time period. A commit (and any segments that then ref count
> to zero) gets deleted after they have been "obsoleted" for more than X
> minutes.
>
> Even though it's not perfect (progress not perfection!), I like it the
> best of the three options discussed on this thread so far because 1)
> it leaves the readers read only, and 2) it should work on all versions
> of NFS.
>
> This would just be a different deletion policy, and it wouldn't be the
> default one. We would leave the default as "keep only last commit
> and delete old one immediately", for backwards compatibility.
>
> Finally, an application can always make their own deletion policy
> (subclass IndexFileDeleter) if they need to.
>
> Mike
>
> > Michael McCandless <lucene [at] mikemccandless> wrote on 18/01/2007
> > 14:24:16:
> >
> >> Marvin Humphrey wrote:
> >>> On Jan 17, 2007, at 1:16 PM, Michael McCandless wrote:
> >>>
> >>>> This is the solution I have in mind for LUCENE-710: change the
> >>>> IndexFileDeleter so that instead of always immediately deleting the
> >>>> last commit when a new commit happens, allow some time before doing
> >>>> so. This way readers have a chance to refresh. The actual time
would
> >>>> be settable by the developer. So if you set it to 6 hours, then, a
> >>>> commit would remain usable for at least 6 hours after it had been
> >>>> obsoleted by a new commit. This means if you can ensure your
readers
> >>>> refresh within 6 hours of a new commit happening, then the writer
will
> >>>> never delete an "in-use" commit.
> >>> I've been mulling this over. If you set the interval to 6 hours, and
> >>> there's a lot of churn (e.g. if you optimize frequently), you'll end
up
> >
> >>> with a lot of wasted disk space. On the flip side, the user has to
set
> >
> >>> up some sort of trigger for refreshing the IndexReaders anyway. It's
> >>> still not user-friendly by default, and we'd be polluting the API
with
> > a
> >>> hateful workaround.
> >> Well, 6 hours would be a long time for such a high turnover site.
> >> They would presumably set the time to something like 10 minutes
> >> instead.
> >>
> >> I think we should decouple the deletion policy from commits. This way
> >> developers could subclass and make their own deletion policy that
> >> suits their application. The IndexFileDeleter base class would do all
> >> the legwork to keep ref counts to all specific index files based on
> >> all segments_N commits that are still "live". Then the deletion
> >> policy just decides which commits should be deleted, when. (This is
> >> roughly what's outlined in LUCENE-710).
> >>
> >> The current policy is to delete all prior commits after a new commit
> >> and that would remain the default.
> >>
> >> Chuck's idea (reference counting via filesystem) would be another
> >> policy. My proposal (delete by time after being obsoleted) would be
> >> another policy, etc.
> >>
> >>> The real problem is NFS. For background, see
> >>> <http://nfs.sourceforge.net/#section_d>, item D2, which deals with
NFS
> >>> and "delete on last close".
> >>>
> >>> Now I wonder. Version 4 of the NFS protocol introduces state, so
it's
> >>> possible to implement file locking. Can we lock a segments file,
then
> >>> have IndexFileDeleter detect which segments are locked that way? And
> > if
> >>> that's the case, can we detect whether the locking mechanism is
failing
> >
> >>> and throw an exception if someone tries to use an earlier version of
> > NFS?
> >> Locking and NFS makes me very nervous :)
> >>
> >>> I'd be cool with making it impossible to put an index on an NFS
volume
> >>> prior to version 4. That puts the blame where it belongs.
> >> Well, most times users have no control over which NFS server and/or
> >> client version is in use, so I think taking this approach of "pinning
> >> the blame" can only hurt our users. I would rather find a solution
> >> that's more portable, if we can (like the ref counting idea Chuck
> >> brought up).
> >>
> >> Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

First page Previous page 1 2 3 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.