Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files

 

 

First page Previous page 1 2 3 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Jul 4, 2012, 6:16 AM

Post #1 of 61 (627 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406497#comment-13406497 ]

Michael McCandless commented on LUCENE-4190:
--------------------------------------------

History repeats itself: LUCENE-385.

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 6:20 AM

Post #2 of 61 (621 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406500#comment-13406500 ]

Robert Muir commented on LUCENE-4190:
-------------------------------------

I dont think we should do anything beyond 'must start with _'

Otherwise this file handling gets complicated again, I don't want to see that!

In all cases, if we start enforcing what the file names for lucene-files must be,
then when we call SegmentInfo.setFiles, we need an assert that they
all in fact actually match this pattern.


> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 6:30 AM

Post #3 of 61 (620 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406510#comment-13406510 ]

Shai Erera commented on LUCENE-4190:
------------------------------------

I also raised an eyebrow when I read this comment. Many of the lucene+facet deployments that I know of store the taxonomy index as a sub-directory of the search index. Also, we've been storing other files in the index directory too ... this new feature will affect such existing deployments.

I think it makes sense to change IW behavior to only delete files that start with _. It's a reasonable requirement IMO.

While I don't know the nature of this change, I can assume it's related to IW not knowing which files to delete when a segment is no longer needed, because Codecs can pick their own file names. If we had an instance which kept track of all files that were created, e.g. every Codec would register the files there (if it wants to protect from their deletion), would make the decision of which files to delete easier?

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 6:40 AM

Post #4 of 61 (615 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406517#comment-13406517 ]

Robert Muir commented on LUCENE-4190:
-------------------------------------

We track the files, but what if your computer crashes during flush, how would we know to erase the broken files.

Lets keep it simple with _

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 6:42 AM

Post #5 of 61 (615 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406519#comment-13406519 ]

Robert Muir commented on LUCENE-4190:
-------------------------------------

patch is against branch_4x

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 6:42 AM

Post #6 of 61 (618 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406520#comment-13406520 ]

Michael McCandless commented on LUCENE-4190:
--------------------------------------------

bq. Many of the lucene+facet deployments that I know of store the taxonomy index as a sub-directory of the search index

We won't delete directories, just files.

bq. Also, we've been storing other files in the index directory too ... this new feature will affect such existing deployments.

Yeah ... better to move them elsewhere or to a sub dir?

bq. I can assume it's related to IW not knowing which files to delete when a segment is no longer needed, because Codecs can pick their own file names

Right: it's easy to track the positive set (files referenced by current segments), what's harder is the negative set (files created in the past but no longer referenced).

bq. If we had an instance which kept track of all files that were created, e.g. every Codec would register the files there (if it wants to protect from their deletion), would make the decision of which files to delete easier?

In theory it would ... but this would add a fair amount of complexity (we'd have to save this list of files into segments_N). In fact long ago Lucene did this (it had a deletable file which stored the list of files previously created and now to-be-deleted).

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 6:42 AM

Post #7 of 61 (618 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406521#comment-13406521 ]

Andi Vajda commented on LUCENE-4190:
------------------------------------

I think that the way to "bound" the namespace of files is to put everything in a subdirectory of the index directory chosen by the user and control the name of that subdirectory, making it clear that this is semi-private to Lucene and that all files in that subdirectory are fair game.

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 6:46 AM

Post #8 of 61 (616 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406525#comment-13406525 ]

Shai Erera commented on LUCENE-4190:
------------------------------------

Let's keep it simple, I agree.

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 6:48 AM

Post #9 of 61 (617 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406529#comment-13406529 ]

Robert Muir commented on LUCENE-4190:
-------------------------------------

We can maybe improve in the future besides the _ check, but I just think this is an easy improvement that will prevent most of the problems.

I just think in this specific case, its critical to balance the complexity of our code against any improvement here, because there can always be a conflict no matter what we do. _ is a nice 80/20 that is simple to understand :)

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 7:13 AM

Post #10 of 61 (619 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406541#comment-13406541 ]

Uwe Schindler commented on LUCENE-4190:
---------------------------------------

Robert: assertSaneFiles is called directly, so the loop is always executed.

Should look like:
{code}assert assertSaneFiles();{code}

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 7:15 AM

Post #11 of 61 (614 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406542#comment-13406542 ]

Robert Muir commented on LUCENE-4190:
-------------------------------------

nice catch, thanks for reviewing Uwe!

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 9:32 AM

Post #12 of 61 (617 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406613#comment-13406613 ]

Michael McCandless commented on LUCENE-4190:
--------------------------------------------

Patch looks good!

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 10:59 AM

Post #13 of 61 (614 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406640#comment-13406640 ]

Hoss Man commented on LUCENE-4190:
----------------------------------

bq. I think that the way to "bound" the namespace of files is to put everything in a subdirectory of the index directory chosen by the user and control the name of that subdirectory, making it clear that this is semi-private to Lucene and that all files in that subdirectory are fair game.

isn't that in theory already the point of the index directory anyway? how far down the rabit hole are we going to go?

bq. We won't delete directories, just files.

One sanity check: this may be an orthoginal issue, but is there anything stoping a codec from using subdirectories? what if i have a codec that creates "_mycodec/foo" and "_mycodec/bar" ... will those not get cleaned up?

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 11:03 AM

Post #14 of 61 (615 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406642#comment-13406642 ]

Robert Muir commented on LUCENE-4190:
-------------------------------------

{quote}
One sanity check: this may be an orthoginal issue, but is there anything stoping a codec from using subdirectories
{quote}

Yes: the fact that Directory abstraction doesnt have a way of dealing with subdirectories. you can only openInput(String) and createOutput(String)

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 11:03 AM

Post #15 of 61 (618 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406644#comment-13406644 ]

Yonik Seeley commented on LUCENE-4190:
--------------------------------------

bq. isn't that in theory already the point of the index directory anyway?

Solr has always supported certain files being in the index directory (elevation and external file field IIRC).

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 11:07 AM

Post #16 of 61 (614 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406646#comment-13406646 ]

Robert Muir commented on LUCENE-4190:
-------------------------------------

Then its probably a good time to put those somewhere else.

the index directory is for the index.

with flexible indexing, filenames are not set in stone.

despite this issue, those files could get deleted by IndexFileDeleter (especially if they start with the name "segments" or an underscore)

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 11:09 AM

Post #17 of 61 (613 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406648#comment-13406648 ]

Yonik Seeley commented on LUCENE-4190:
--------------------------------------

How about we keep it simple and practical over pedantic and just go with the underscore.

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 11:13 AM

Post #18 of 61 (618 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406652#comment-13406652 ]

Robert Muir commented on LUCENE-4190:
-------------------------------------

I'm not trying to be pedantic: I don't want the crazy files() handling we had before: its too much.

Thats why i took this issue (and already committed, see the commits list) to only delete files that start with underscores.

I'm just mentioning that in general its still not really safe to put files in the index directory: this is just a simple solution that doesnt bring back the complicated files() handling we had that nobody really understood.

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 12:15 PM

Post #19 of 61 (619 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406669#comment-13406669 ]

Andi Vajda commented on LUCENE-4190:
------------------------------------

If Joe user gives c:\ as their index directory, which is silly, sure, it's even worse to just delete all files in there.
Even if you just delete files there that are prefixed with _, we should know better than that. By putting the files we want to control into their own directory, a subdirectory of the Lucene index directory, there is very little room for mistakes.
_ is just not a namespace for files reserved to Lucene, but a sub-directory chosen by Lucene instead is.
If you persist in picking just _, why not picking _90439043_ to make it at least more unique ?

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 12:30 PM

Post #20 of 61 (607 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406671#comment-13406671 ]

Robert Muir commented on LUCENE-4190:
-------------------------------------

now I'm ready to be pedantic: I refuse to let file handling get complicated for this stuff. its not important and we are trying to make a search engine not a file manager.

tomorrow, ill revert my commit and go back to deleting all files as I was totally happy with the previous situation: just don't put stuff in the lucene index directory.

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 12:52 PM

Post #21 of 61 (603 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406682#comment-13406682 ]

Yonik Seeley commented on LUCENE-4190:
--------------------------------------

bq. Thats why i took this issue (and already committed, see the commits list)

Hmmm, I don't see any commits under this issue. But searching the mailing list, I do see that you correctly used the issue name in the commit log - so I guess we can chalk it up to the recent ASF infra flakiness.

bq. tomorrow, ill revert my commit and go back to deleting all files

-1 to reverting, we've made progress!

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 4:20 PM

Post #22 of 61 (598 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406784#comment-13406784 ]

Robert Muir commented on LUCENE-4190:
-------------------------------------

I still plan to revert tomorrow. I think the comments here are a bad sign... I sent us down an unfortunate slippery slope

we are a search engine library!

don't put shit in the index directory! or we will delete your shit. dead simple.

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 4:33 PM

Post #23 of 61 (595 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406786#comment-13406786 ]

Yonik Seeley commented on LUCENE-4190:
--------------------------------------

bq. I still plan to revert tomorrow.

-1

The now current behavior is more consistent with the legacy behavior, and everyone else seems to agree it's an improvement (although some feel we should go further). Reverting this bug fix without a better fix makes no sense.

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 4:51 PM

Post #24 of 61 (598 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406789#comment-13406789 ]

Robert Muir commented on LUCENE-4190:
-------------------------------------

pretty sure I have the right to revert my own commit.

I can declare the licensing of asl2 as a mistake and instead full gpl if we want to press the point?

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene


jira at apache

Jul 4, 2012, 11:39 PM

Post #25 of 61 (593 views)
Permalink
[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406874#comment-13406874 ]

Shai Erera commented on LUCENE-4190:
------------------------------------

What if we had an object called IndexFileNames with a method accept(String name), that returns true if the file is recognized, false otherwise - that could give applications a way to create a recognized-set of index files:
* Lucene would provide a DefaultIndexFileNames which recognizes all non-codec files
* Either the app would provide an extension to the default (or a wrapper) which recognizes its codec files as well
** Or, we make the Codec responsible for recognizing files too, and then the code would just query the Codec for non-default index files.

Either way, it seems like we can very easily recognize what are index files and what aren't.

When files need to be deleted, it seems simple as well:
* Lucene lists all files in the directory
* Any file that is referenced by the index (I assume we still know which files are needed right?) is kept
* Any other file is queried against IndexFileNames.accept and if it is accepted, it's deleted, otherwise it's left alone.

Since this looks too simple to me, I'm assuming that I'm missing something. If so, can someone please clarify the problem to me?

> IndexWriter deletes non-Lucene files
> ------------------------------------
>
> Key: LUCENE-4190
> URL: https://issues.apache.org/jira/browse/LUCENE-4190
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4190.patch, LUCENE-4190.patch
>
>
> Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog post: http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
> IndexWriter will now (as of 4.0) delete all foreign files from the index directory. We made this change because Codecs are free to write to any files now, so the space of filenames is hard to "bound".
> But if the user accidentally uses the wrong directory (eg c:/) then we will in fact delete important stuff.
> I think we can at least use some simple criteria (must start with _, maybe must fit certain pattern eg _<base36>(_X).Y), so we are much less likely to delete a non-Lucene file....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe [at] lucene
For additional commands, e-mail: dev-help [at] lucene

First page Previous page 1 2 3 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.