Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

lucene indexes back up strategies

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


typhoon_larry at hotmail

Apr 27, 2007, 9:29 AM

Post #1 of 6 (317 views)
Permalink
lucene indexes back up strategies

I'm pondering on long term maintenance issues with Lucene indexes and would
like to know of anyone's suggestions or recommendations to backing up these
indexes. My goal is to have a weekly, or even daily, snapshot of the
current index to make sure it is recoverable if the index gets corrupted. I
won't be able to reindex since my database contains millions of records so
reindexing on-the-fly is not an option. Also the index size is growing
fast--already at the 56GB mark--that I'm not even sure creating a snapshot
copy is fast enough. Maybe clustering is better?

Hence if anyone has any recommendations regarding backup strategies for
lucene indexes, I would be grateful.

Sincerely,

LH
--
View this message in context: http://www.nabble.com/lucene-indexes-back-up-strategies-tf3658495.html#a10221953
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


mail at mikemccandless

Apr 27, 2007, 10:48 AM

Post #2 of 6 (306 views)
Permalink
Re: lucene indexes back up strategies [In reply to]

"larry hughes" <typhoon_larry [at] hotmail> wrote:

> I'm pondering on long term maintenance issues with Lucene indexes
> and would like to know of anyone's suggestions or recommendations to
> backing up these indexes. My goal is to have a weekly, or even
> daily, snapshot of the current index to make sure it is recoverable
> if the index gets corrupted. I won't be able to reindex since my
> database contains millions of records so reindexing on-the-fly is
> not an option. Also the index size is growing fast--already at the
> 56GB mark--that I'm not even sure creating a snapshot copy is fast
> enough. Maybe clustering is better?

One effective way to backup the index is to copy only the new files.
Since Lucene is write once (as of 2.1), you only need to back up any
new file names that have appeared since your last backup. You can
also remove all now-deleted filenames if you are only interested in
the most recent snapshot.

Normally you must pause indexing to do this backup (since filenames
are changing and/or being deleted) but it's possible with the trunk
version of Lucene to make a simple index deletion policy that would
allow you to run a backup slowly in the background without pausing
indexing.

Basically this deletion policy would "keep alive" the one commit point
that was current when you started your backup, so even as the index is
changing, all segment files referenced by that commit point would not
be deleted; then your backup would copy the files referenced by that
commit point. Once the backup completes then you would allow that
commit point to be deleted.

This would allow you to do "live" backups (backup while indexing is
still happening).

Mike
--
Michael McCandless
mail [at] mikemccandless


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


typhoon_larry at hotmail

Apr 27, 2007, 11:36 AM

Post #3 of 6 (307 views)
Permalink
Re: lucene indexes back up strategies [In reply to]

Thanks Mike,

Wow, I did not know Lucene 2.1 can do all of this. The problem is that I'm
currently using 2.0. Is there something similar to what you just mentioned
in dealing with 2.0 indexes--backing up piecewise? Thanks again.

LH


Michael McCandless-3 wrote:
>
>
>
> "larry hughes" <typhoon_larry [at] hotmail> wrote:
>
>> I'm pondering on long term maintenance issues with Lucene indexes
>> and would like to know of anyone's suggestions or recommendations to
>> backing up these indexes. My goal is to have a weekly, or even
>> daily, snapshot of the current index to make sure it is recoverable
>> if the index gets corrupted. I won't be able to reindex since my
>> database contains millions of records so reindexing on-the-fly is
>> not an option. Also the index size is growing fast--already at the
>> 56GB mark--that I'm not even sure creating a snapshot copy is fast
>> enough. Maybe clustering is better?
>
> One effective way to backup the index is to copy only the new files.
> Since Lucene is write once (as of 2.1), you only need to back up any
> new file names that have appeared since your last backup. You can
> also remove all now-deleted filenames if you are only interested in
> the most recent snapshot.
>
> Normally you must pause indexing to do this backup (since filenames
> are changing and/or being deleted) but it's possible with the trunk
> version of Lucene to make a simple index deletion policy that would
> allow you to run a backup slowly in the background without pausing
> indexing.
>
> Basically this deletion policy would "keep alive" the one commit point
> that was current when you started your backup, so even as the index is
> changing, all segment files referenced by that commit point would not
> be deleted; then your backup would copy the files referenced by that
> commit point. Once the backup completes then you would allow that
> commit point to be deleted.
>
> This would allow you to do "live" backups (backup while indexing is
> still happening).
>
> Mike
> --
> Michael McCandless
> mail [at] mikemccandless
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
>

--
View this message in context: http://www.nabble.com/lucene-indexes-back-up-strategies-tf3658495.html#a10224326
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


mail at mikemccandless

Apr 27, 2007, 1:13 PM

Post #4 of 6 (310 views)
Permalink
Re: lucene indexes back up strategies [In reply to]

"larry hughes" <typhoon_larry [at] hotmail> wrote:

> Wow, I did not know Lucene 2.1 can do all of this. The problem is that I'm
> currently using 2.0. Is there something similar to what you just mentioned
> in dealing with 2.0 indexes--backing up piecewise? Thanks again.

Hmm, OK. Pre-2.1 Lucene will overwrite at least the file "segments",
*.del (per segment deletions) and *.sN (only if you set norms, which
is a rather advanced function). So probably best to use something
like "rsync" which I believe looks @ timestamp and file size to
determine that a file has changed, and then copies it over.

Also make sure all writers are closed before running the backup and no
writer opens until the backup completes (ie they are exclusive).

Mike
--
Michael McCandless
mail [at] mikemccandless


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


hossman_lucene at fucit

Apr 27, 2007, 5:20 PM

Post #5 of 6 (303 views)
Permalink
Re: lucene indexes back up strategies [In reply to]

: > Wow, I did not know Lucene 2.1 can do all of this. The problem is that I'm
: > currently using 2.0. Is there something similar to what you just mentioned
: > in dealing with 2.0 indexes--backing up piecewise? Thanks again.
:
: Hmm, OK. Pre-2.1 Lucene will overwrite at least the file "segments",
: *.del (per segment deletions) and *.sN (only if you set norms, which
: is a rather advanced function). So probably best to use something
: like "rsync" which I believe looks @ timestamp and file size to
: determine that a file has changed, and then copies it over.

you should take a look at the snapshooter and backup scripts that come
with Solr, the concepts can be applied to any lucene index not just Solr,
but they rely on your filesystem supportin hardlinks...

http://svn.apache.org/viewvc/lucene/solr/trunk/src/scripts/
http://wiki.apache.org/solr/CollectionDistribution


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


typhoon_larry at hotmail

May 1, 2007, 1:12 PM

Post #6 of 6 (280 views)
Permalink
Re: lucene indexes back up strategies [In reply to]

Hi Mike,

I decided to go ahead and upgrade to Lucene 2.1. My regression tests seem
fine. However, I still don't understand the files of the index you've
described.

>>You can also remove all now-deleted filenames if you are only interested
in the most recent snapshot.
I'm not sure what these "now-deleted filenames" you are referring to. I'm
indexing a sample set of 10,000 records for my little unit-test and I don't
see (or know) which files would be considered "deleted". I only see five
files for my 10k records. They are:

_cd.cfs
_d1.cfs
clearcache
segments.gen
segments_d3


>>it's possible with the trunk version of Lucene to make a simple index
deletion policy that would allow you to run a backup slowly in the
background without pausing indexing.
Two questions here. 1) when you mean trunk version, can I safely assume
that it has already been packaged in lucene-2.1.0.tar.gz distributed on
17-Feb-2007? 2) I don't understand what you mean by "simple index deletion
policy". I think this is a new feature since I can't find it anywhere in my
Lucene In Action book. Is there some documentation on this? Seems like
this deletion policy is the key for making live backups.

Thank you.

LH



Michael McCandless-3 wrote:
>
>
>
> "larry hughes" <typhoon_larry [at] hotmail> wrote:
>
>> I'm pondering on long term maintenance issues with Lucene indexes
>> and would like to know of anyone's suggestions or recommendations to
>> backing up these indexes. My goal is to have a weekly, or even
>> daily, snapshot of the current index to make sure it is recoverable
>> if the index gets corrupted. I won't be able to reindex since my
>> database contains millions of records so reindexing on-the-fly is
>> not an option. Also the index size is growing fast--already at the
>> 56GB mark--that I'm not even sure creating a snapshot copy is fast
>> enough. Maybe clustering is better?
>
> One effective way to backup the index is to copy only the new files.
> Since Lucene is write once (as of 2.1), you only need to back up any
> new file names that have appeared since your last backup. You can
> also remove all now-deleted filenames if you are only interested in
> the most recent snapshot.
>
> Normally you must pause indexing to do this backup (since filenames
> are changing and/or being deleted) but it's possible with the trunk
> version of Lucene to make a simple index deletion policy that would
> allow you to run a backup slowly in the background without pausing
> indexing.
>
> Basically this deletion policy would "keep alive" the one commit point
> that was current when you started your backup, so even as the index is
> changing, all segment files referenced by that commit point would not
> be deleted; then your backup would copy the files referenced by that
> commit point. Once the backup completes then you would allow that
> commit point to be deleted.
>
> This would allow you to do "live" backups (backup while indexing is
> still happening).
>
> Mike
> --
> Michael McCandless
> mail [at] mikemccandless
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
>

--
View this message in context: http://www.nabble.com/lucene-indexes-back-up-strategies-tf3658495.html#a10275148
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.