Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Indexing problem

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


daryl at montagetech

Nov 1, 2001, 5:46 PM

Post #1 of 18 (951 views)
Permalink
Indexing problem

Hi

Since upgrading to 1.2 we've started getting the following error when
creating an index in a directory with a large amount of files. Previous
versions of Lucene were quite happy to index this directory.

Any thoughts as to the cause?

-d

java.io.FileNotFoundException:
/private/Network/Servers/montage/Volumes/Disk2/Users/daryl/Library/Index/index5.
mtlibx/_n8.f3 (Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:143)
at org.apache.lucene.store.FSInputStream$Descriptor.<init>(Unknown
Source)
at org.apache.lucene.store.FSInputStream.<init>(Unknown Source)
at org.apache.lucene.store.FSDirectory.openFile(Unknown Source)
at org.apache.lucene.index.SegmentReader.openNorms(Unknown Source)
at org.apache.lucene.index.SegmentReader.<init>(Unknown Source)
at org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
at org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
at org.apache.lucene.index.IndexWriter.maybeMergeSegments(Unknown
Source)
at org.apache.lucene.index.IndexWriter.addDocument(Unknown Source)
at IndexCreator.indexDocs(IndexCreator.java:75)
at IndexCreator.indexDocs(IndexCreator.java:67)
at IndexCreator.indexDocs(IndexCreator.java:67)
at IndexCreator.indexDocs(IndexCreator.java:67)
at IndexCreator.indexDocs(IndexCreator.java:67)
at IndexCreator.indexDocs(IndexCreator.java:67)
at IndexCreator.createIndex(IndexCreator.java:44)

------
Daryl Thachuk daryl[at]montagetech.com
Montage Technologies Inc.
http://www.montagetech.com


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


alex.paransky at individualnetwork

Nov 1, 2001, 6:05 PM

Post #2 of 18 (929 views)
Permalink
RE: Indexing problem [In reply to]

I was getting the same error under Windows 2000. It has something to do
with FileHandles being openned. When we were doing a few searches at the
same time, the file handles went as hight as 2000. At around 2100, we
started to get this error. I have searched some documentation, and all the
docs I have found says that number of file handles in Windows is limited by
memory. Well, I have a 512MB of memory in this machine plus 700MB swap
space, so I cannot be running out of memory.

To make the long story short, I don't know what is causing this problem or
how to fix it.

-AP_

-----Original Message-----
From: Daryl Thachuk [mailto:daryl[at]montagetech.com]
Sent: Thursday, November 01, 2001 4:46 PM
To: lucene-user[at]jakarta.apache.org
Subject: Indexing problem


Hi

Since upgrading to 1.2 we've started getting the following error when
creating an index in a directory with a large amount of files. Previous
versions of Lucene were quite happy to index this directory.

Any thoughts as to the cause?

-d

java.io.FileNotFoundException:
/private/Network/Servers/montage/Volumes/Disk2/Users/daryl/Library/Index/ind
ex5.
mtlibx/_n8.f3 (Too many open files)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:143)
at org.apache.lucene.store.FSInputStream$Descriptor.<init>(Unknown
Source)
at org.apache.lucene.store.FSInputStream.<init>(Unknown Source)
at org.apache.lucene.store.FSDirectory.openFile(Unknown Source)
at org.apache.lucene.index.SegmentReader.openNorms(Unknown Source)
at org.apache.lucene.index.SegmentReader.<init>(Unknown Source)
at org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
at org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
at org.apache.lucene.index.IndexWriter.maybeMergeSegments(Unknown
Source)
at org.apache.lucene.index.IndexWriter.addDocument(Unknown Source)
at IndexCreator.indexDocs(IndexCreator.java:75)
at IndexCreator.indexDocs(IndexCreator.java:67)
at IndexCreator.indexDocs(IndexCreator.java:67)
at IndexCreator.indexDocs(IndexCreator.java:67)
at IndexCreator.indexDocs(IndexCreator.java:67)
at IndexCreator.indexDocs(IndexCreator.java:67)
at IndexCreator.createIndex(IndexCreator.java:44)

------
Daryl Thachuk daryl[at]montagetech.com
Montage Technologies Inc.
http://www.montagetech.com


--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


daryl at montagetech

Nov 1, 2001, 6:08 PM

Post #3 of 18 (934 views)
Permalink
Re: Indexing problem [In reply to]

Since pre 1.2 versions of Lucene do not demonstrate this problem (I've
confirmed this) I suspect something is not being closed properly in the
index creation process.

-d

On Thursday, November 1, 2001, at 06:05 PM, Alex Paransky wrote:

> I was getting the same error under Windows 2000. It has something to do
> with FileHandles being openned. When we were doing a few searches at the
> same time, the file handles went as hight as 2000. At around 2100, we
> started to get this error. I have searched some documentation, and all
> the
> docs I have found says that number of file handles in Windows is
> limited by
> memory. Well, I have a 512MB of memory in this machine plus 700MB swap
> space, so I cannot be running out of memory.
>
> To make the long story short, I don't know what is causing this problem
> or
> how to fix it.
>
> -AP_
>
------
Daryl Thachuk daryl[at]montagetech.com
Montage Technologies Inc.
http://www.montagetech.com


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


teskridge at ai

Nov 1, 2001, 8:16 PM

Post #4 of 18 (933 views)
Permalink
RE: Indexing problem [In reply to]

I have seen this problem, too, since moving to 1.2 although I'm not sure
that 1.2 is the cause... but it is suspicious.

Tom Eskridge


> -----Original Message-----
> From: Daryl Thachuk [mailto:daryl[at]montagetech.com]
> Sent: Thursday, November 01, 2001 7:09 PM
> To: Lucene Users List
> Subject: Re: Indexing problem
>
>
> Since pre 1.2 versions of Lucene do not demonstrate this problem (I've
> confirmed this) I suspect something is not being closed properly in the
> index creation process.
>
> -d
>
> On Thursday, November 1, 2001, at 06:05 PM, Alex Paransky wrote:
>
> > I was getting the same error under Windows 2000. It has something to do
> > with FileHandles being openned. When we were doing a few searches at the
> > same time, the file handles went as hight as 2000. At around 2100, we
> > started to get this error. I have searched some documentation, and all
> > the
> > docs I have found says that number of file handles in Windows is
> > limited by
> > memory. Well, I have a 512MB of memory in this machine plus 700MB swap
> > space, so I cannot be running out of memory.
> >
> > To make the long story short, I don't know what is causing this problem
> > or
> > how to fix it.
> >
> > -AP_
> >
> ------
> Daryl Thachuk daryl[at]montagetech.com
> Montage Technologies Inc.
> http://www.montagetech.com
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help[at]jakarta.apache.org>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


scott.ganyo at eTapestry

Nov 2, 2001, 5:13 AM

Post #5 of 18 (923 views)
Permalink
RE: Indexing problem [In reply to]

Yes. You have too many open files. There are a few things you can try. 1)
Increase the number of file handles your system has available. Yes, there
is a setting for this in Windows. 2) Make sure that you have the
IndexWriter.maxMergeDocs set to Integer.MAX_VALUE (the default). 3) Try
smaller values for IndexWriter.mergeFactor (default is 10). 4) When all
else fails, do all your indexing in memory and then write it out to disk
when you're done. Doug posted an example of this just a couple days ago.

Scott

> -----Original Message-----
> From: Daryl Thachuk [mailto:daryl[at]montagetech.com]
> Sent: Thursday, November 01, 2001 7:46 PM
> To: lucene-user[at]jakarta.apache.org
> Subject: Indexing problem
>
>
> Hi
>
> Since upgrading to 1.2 we've started getting the following
> error when
> creating an index in a directory with a large amount of
> files. Previous
> versions of Lucene were quite happy to index this directory.
>
> Any thoughts as to the cause?
>
> -d
>
> java.io.FileNotFoundException:
> /private/Network/Servers/montage/Volumes/Disk2/Users/daryl/Lib
> rary/Index/index5.
> mtlibx/_n8.f3 (Too many open files)
> at java.io.RandomAccessFile.open(Native Method)
> at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98)
> at java.io.RandomAccessFile.<init>(RandomAccessFile.java:143)
> at
> org.apache.lucene.store.FSInputStream$Descriptor.<init>(Unknown
> Source)
> at org.apache.lucene.store.FSInputStream.<init>(Unknown Source)
> at org.apache.lucene.store.FSDirectory.openFile(Unknown Source)
> at
> org.apache.lucene.index.SegmentReader.openNorms(Unknown Source)
> at org.apache.lucene.index.SegmentReader.<init>(Unknown Source)
> at
> org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
> at
> org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
> at
> org.apache.lucene.index.IndexWriter.maybeMergeSegments(Unknown
> Source)
> at
> org.apache.lucene.index.IndexWriter.addDocument(Unknown Source)
> at IndexCreator.indexDocs(IndexCreator.java:75)
> at IndexCreator.indexDocs(IndexCreator.java:67)
> at IndexCreator.indexDocs(IndexCreator.java:67)
> at IndexCreator.indexDocs(IndexCreator.java:67)
> at IndexCreator.indexDocs(IndexCreator.java:67)
> at IndexCreator.indexDocs(IndexCreator.java:67)
> at IndexCreator.createIndex(IndexCreator.java:44)
>
> ------
> Daryl Thachuk daryl[at]montagetech.com
> Montage Technologies Inc.
> http://www.montagetech.com
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>


pfriedman at macromedia

Nov 2, 2001, 7:12 AM

Post #6 of 18 (928 views)
Permalink
RE: Indexing problem [In reply to]

Where can I get Doug's example of indexing in memory and then writing it out
to disk? I just recently subscribed to this list and I can't find it in the
archive.

Thanks.
Paul

-----Original Message-----
From: Scott Ganyo [mailto:scott.ganyo[at]eTapestry.com]
Sent: Friday, November 02, 2001 7:14 AM
To: 'Lucene Users List'
Subject: RE: Indexing problem


Yes. You have too many open files. There are a few things you can try. 1)
Increase the number of file handles your system has available. Yes, there
is a setting for this in Windows. 2) Make sure that you have the
IndexWriter.maxMergeDocs set to Integer.MAX_VALUE (the default). 3) Try
smaller values for IndexWriter.mergeFactor (default is 10). 4) When all
else fails, do all your indexing in memory and then write it out to disk
when you're done. Doug posted an example of this just a couple days ago.

Scott

> -----Original Message-----
> From: Daryl Thachuk [mailto:daryl[at]montagetech.com]
> Sent: Thursday, November 01, 2001 7:46 PM
> To: lucene-user[at]jakarta.apache.org
> Subject: Indexing problem
>
>
> Hi
>
> Since upgrading to 1.2 we've started getting the following
> error when
> creating an index in a directory with a large amount of
> files. Previous
> versions of Lucene were quite happy to index this directory.
>
> Any thoughts as to the cause?
>
> -d
>
> java.io.FileNotFoundException:
> /private/Network/Servers/montage/Volumes/Disk2/Users/daryl/Lib
> rary/Index/index5.
> mtlibx/_n8.f3 (Too many open files)
> at java.io.RandomAccessFile.open(Native Method)
> at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98)
> at java.io.RandomAccessFile.<init>(RandomAccessFile.java:143)
> at
> org.apache.lucene.store.FSInputStream$Descriptor.<init>(Unknown
> Source)
> at org.apache.lucene.store.FSInputStream.<init>(Unknown Source)
> at org.apache.lucene.store.FSDirectory.openFile(Unknown Source)
> at
> org.apache.lucene.index.SegmentReader.openNorms(Unknown Source)
> at org.apache.lucene.index.SegmentReader.<init>(Unknown Source)
> at
> org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
> at
> org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
> at
> org.apache.lucene.index.IndexWriter.maybeMergeSegments(Unknown
> Source)
> at
> org.apache.lucene.index.IndexWriter.addDocument(Unknown Source)
> at IndexCreator.indexDocs(IndexCreator.java:75)
> at IndexCreator.indexDocs(IndexCreator.java:67)
> at IndexCreator.indexDocs(IndexCreator.java:67)
> at IndexCreator.indexDocs(IndexCreator.java:67)
> at IndexCreator.indexDocs(IndexCreator.java:67)
> at IndexCreator.indexDocs(IndexCreator.java:67)
> at IndexCreator.createIndex(IndexCreator.java:44)
>
> ------
> Daryl Thachuk daryl[at]montagetech.com
> Montage Technologies Inc.
> http://www.montagetech.com
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


scott.ganyo at eTapestry

Nov 2, 2001, 8:47 AM

Post #7 of 18 (928 views)
Permalink
RE: Indexing problem [In reply to]

Well, I don't know if there's an archive of the list, so this what Doug
wrote:

"
A more efficient and slightly more complex approach would be to build large
indexes in RAM, and copy them to disk with IndexWriter.addIndexes:
IndexWriter fsWriter = new IndexWriter(new File(...), analyzer, true);
while (... more docs to index...)
RAMDirectory ramDir = new RAMDirectory();
IndexWriter ramWriter = new IndexWriter(ramDir, analyzer, true);
... add 100,000 docs to ramWriter ...
ramWriter.optimize();
ramWriter.close();
fsWriter.addIndexes(new Directory[] { ramDir });
}
fsWriter.optimize();
fsWriter.close();
"

Scott

> -----Original Message-----
> From: Paul Friedman [mailto:pfriedman[at]macromedia.com]
> Sent: Friday, November 02, 2001 9:13 AM
> To: 'Lucene Users List'
> Subject: RE: Indexing problem
>
>
> Where can I get Doug's example of indexing in memory and then
> writing it out
> to disk? I just recently subscribed to this list and I can't
> find it in the
> archive.
>
> Thanks.
> Paul
>
> -----Original Message-----
> From: Scott Ganyo [mailto:scott.ganyo[at]eTapestry.com]
> Sent: Friday, November 02, 2001 7:14 AM
> To: 'Lucene Users List'
> Subject: RE: Indexing problem
>
>
> Yes. You have too many open files. There are a few things
> you can try. 1)
> Increase the number of file handles your system has
> available. Yes, there
> is a setting for this in Windows. 2) Make sure that you have the
> IndexWriter.maxMergeDocs set to Integer.MAX_VALUE (the
> default). 3) Try
> smaller values for IndexWriter.mergeFactor (default is 10).
> 4) When all
> else fails, do all your indexing in memory and then write it
> out to disk
> when you're done. Doug posted an example of this just a
> couple days ago.
>
> Scott
>
> > -----Original Message-----
> > From: Daryl Thachuk [mailto:daryl[at]montagetech.com]
> > Sent: Thursday, November 01, 2001 7:46 PM
> > To: lucene-user[at]jakarta.apache.org
> > Subject: Indexing problem
> >
> >
> > Hi
> >
> > Since upgrading to 1.2 we've started getting the following
> > error when
> > creating an index in a directory with a large amount of
> > files. Previous
> > versions of Lucene were quite happy to index this directory.
> >
> > Any thoughts as to the cause?
> >
> > -d
> >
> > java.io.FileNotFoundException:
> > /private/Network/Servers/montage/Volumes/Disk2/Users/daryl/Lib
> > rary/Index/index5.
> > mtlibx/_n8.f3 (Too many open files)
> > at java.io.RandomAccessFile.open(Native Method)
> > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98)
> > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:143)
> > at
> > org.apache.lucene.store.FSInputStream$Descriptor.<init>(Unknown
> > Source)
> > at org.apache.lucene.store.FSInputStream.<init>(Unknown Source)
> > at org.apache.lucene.store.FSDirectory.openFile(Unknown Source)
> > at
> > org.apache.lucene.index.SegmentReader.openNorms(Unknown Source)
> > at org.apache.lucene.index.SegmentReader.<init>(Unknown Source)
> > at
> > org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
> > at
> > org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
> > at
> > org.apache.lucene.index.IndexWriter.maybeMergeSegments(Unknown
> > Source)
> > at
> > org.apache.lucene.index.IndexWriter.addDocument(Unknown Source)
> > at IndexCreator.indexDocs(IndexCreator.java:75)
> > at IndexCreator.indexDocs(IndexCreator.java:67)
> > at IndexCreator.indexDocs(IndexCreator.java:67)
> > at IndexCreator.indexDocs(IndexCreator.java:67)
> > at IndexCreator.indexDocs(IndexCreator.java:67)
> > at IndexCreator.indexDocs(IndexCreator.java:67)
> > at IndexCreator.createIndex(IndexCreator.java:44)
> >
> > ------
> > Daryl Thachuk daryl[at]montagetech.com
> > Montage Technologies Inc.
> > http://www.montagetech.com
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help[at]jakarta.apache.org>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help[at]jakarta.apache.org>
>


daryl at montagetech

Nov 2, 2001, 9:03 AM

Post #8 of 18 (921 views)
Permalink
Re: Indexing problem [In reply to]

A question I'd like answered is, why do I now have to be concerned about
having too many files open when before I didn't? What has changed to
cause this? This sounds like a bug to me.

-d

On Friday, November 2, 2001, at 05:13 AM, Scott Ganyo wrote:

> Yes. You have too many open files. There are a few things you can
> try. 1)
> Increase the number of file handles your system has available. Yes,
> there
> is a setting for this in Windows. 2) Make sure that you have the
> IndexWriter.maxMergeDocs set to Integer.MAX_VALUE (the default). 3) Try
> smaller values for IndexWriter.mergeFactor (default is 10). 4) When all
> else fails, do all your indexing in memory and then write it out to disk
> when you're done. Doug posted an example of this just a couple days
> ago.
>
------
Daryl Thachuk daryl[at]montagetech.com
Montage Technologies Inc.
http://www.montagetech.com


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


jaxtrx at xde

Nov 2, 2001, 9:12 AM

Post #9 of 18 (940 views)
Permalink
RE: Indexing problem [In reply to]

Don't know if this helps, but I indexed over 600,000 150k html files on W2K
and linux, then I did 40,000 2mb html files and didn't have any issues. I
used the demo html indexer class.



-----Original Message-----
From: Daryl Thachuk [mailto:daryl[at]montagetech.com]
Sent: Friday, November 02, 2001 10:03 AM
To: Lucene Users List
Subject: Re: Indexing problem



A question I'd like answered is, why do I now have to be concerned about
having too many files open when before I didn't? What has changed to
cause this? This sounds like a bug to me.

-d

On Friday, November 2, 2001, at 05:13 AM, Scott Ganyo wrote:

> Yes. You have too many open files. There are a few things you can
> try. 1)
> Increase the number of file handles your system has available. Yes,
> there
> is a setting for this in Windows. 2) Make sure that you have the
> IndexWriter.maxMergeDocs set to Integer.MAX_VALUE (the default). 3) Try
> smaller values for IndexWriter.mergeFactor (default is 10). 4) When all
> else fails, do all your indexing in memory and then write it out to disk
> when you're done. Doug posted an example of this just a couple days
> ago.
>
------
http://www.montagetech.com


--
To unsubscribe, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>
For additional commands, e-mail:



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


daryl at montagetech

Nov 2, 2001, 9:15 AM

Post #10 of 18 (933 views)
Permalink
Re: Indexing problem [In reply to]

What version of Lucene did you use?

-d

On Friday, November 2, 2001, at 09:12 AM, jaxtrx wrote:

> Don't know if this helps, but I indexed over 600,000 150k html files on
> W2K
> and linux, then I did 40,000 2mb html files and didn't have any
> issues. I
> used the demo html indexer class.
>
------
Daryl Thachuk daryl[at]montagetech.com
Montage Technologies Inc.
http://www.montagetech.com


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


jaxtrx at xde

Nov 2, 2001, 9:23 AM

Post #11 of 18 (940 views)
Permalink
RE: Indexing problem [In reply to]

lucene-1.2-rc2

-----Original Message-----
From: Daryl Thachuk [mailto:daryl[at]montagetech.com]
Sent: Friday, November 02, 2001 10:16 AM
To: Lucene Users List
Subject: Re: Indexing problem


What version of Lucene did you use?

-d

On Friday, November 2, 2001, at 09:12 AM, jaxtrx wrote:

> Don't know if this helps, but I indexed over 600,000 150k html files on
> W2K
> and linux, then I did 40,000 2mb html files and didn't have any
> issues. I
> used the demo html indexer class.
>
------
http://www.montagetech.com


--
To unsubscribe, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>
For additional commands, e-mail:



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


DCutting at grandcentral

Nov 2, 2001, 9:49 AM

Post #12 of 18 (930 views)
Permalink
RE: Indexing problem [In reply to]

> From: Daryl Thachuk [mailto:daryl[at]montagetech.com]
>
> A question I'd like answered is, why do I now have to be
> concerned about
> having too many files open when before I didn't? What has changed to
> cause this? This sounds like a bug to me.

Sigh.

IndexReader now keeps all files that are not read entirely into memory open
as long as the IndexReader is open. This was to fix the bug where another
thread or process, while updating the index, would delete files that an open
index reader might need. So there are now a few more files kept open per
segment, making it easier to run out of file handles. IndexWriter uses
IndexReader internally, so the number of open files while indexing has also
increased.

In particular, there are five files, plus one per field, kept open per
segment. While indexing, a maximum of IndexWriter.MergeFactor+1 segments
are ever open at once. So a million document, three field index with
IndexWriter.MergeFactor=10, would have a maximum of 88 files open at a time
while indexing.

Note however, that an IndexReader must keep all segments open. The maximum
number of segments in an index is (k - 1) * ( log_k(N) - 1), where k is the
IndexWriter.mergeFactor and N is the number of documents. So an index with
a million documents could have up to 45 segments (on average it will have
22.5). With three fields, an unoptimized IndexReader would require a
maximum of 360 open files. Once optimized to a single segment, it would
require only 8 open files.

In practice, this should not be a problem. Have you raised
IndexWriter.mergeFactor? If so, try lowering it to the default, 10. Are
you also opening IndexReaders in the same process? If so, keep just one per
index, shared by all search threads, and, if possible, only open a new one
when the index has just been optimized. Ideally, document additions should
be batched, and finished by a call to optimize(). Not only do optimized
indexes have fewer files open, but they're must faster to search.

Strictly speaking, since there is only supposed to be a single writer for an
index at a time, IndexWriter does not need to keep files open except when it
is using them. So the number of file handles used while indexing could be
reduced if IndexWriter were permitted to open IndexReaders in a special
private mode, where files are opened on demand and closed prompty. That
said, this might permit you to more easily create an index that you cannot
read!

On the upside, at search time, each query used to open a file per term (two
files per phrase term) per segment. So big queries, or lots of concurrent
small ones, used to run out of file handles. This is no longer the case.
IndexReader now opens every file once and only once. Now it just keeps most
of them open...

Doug


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


daryl at montagetech

Nov 2, 2001, 10:04 AM

Post #13 of 18 (922 views)
Permalink
Re: Indexing problem [In reply to]

That makes sense. Thanks for the explanation Dave. I'll make the
appropriate changes to my code.

Thanks again for your help

-daryl

On Friday, November 2, 2001, at 09:49 AM, Doug Cutting wrote:

> Sigh.
>
> IndexReader now keeps all files that are not read entirely into memory
> open
> as long as the IndexReader is open. This was to fix the bug where
> another
> thread or process, while updating the index, would delete files that an
> open
> index reader might need. So there are now a few more files kept open
> per
> segment, making it easier to run out of file handles. IndexWriter uses
> IndexReader internally, so the number of open files while indexing has
> also
> increased.
>
> In particular, there are five files, plus one per field, kept open per
> segment. While indexing, a maximum of IndexWriter.MergeFactor+1
> segments
> are ever open at once. So a million document, three field index with
> IndexWriter.MergeFactor=10, would have a maximum of 88 files open at a
> time
> while indexing.
>
> Note however, that an IndexReader must keep all segments open. The
> maximum
> number of segments in an index is (k - 1) * ( log_k(N) - 1), where k is
> the
> IndexWriter.mergeFactor and N is the number of documents. So an index
> with
> a million documents could have up to 45 segments (on average it will
> have
> 22.5). With three fields, an unoptimized IndexReader would require a
> maximum of 360 open files. Once optimized to a single segment, it would
> require only 8 open files.
>
> In practice, this should not be a problem. Have you raised
> IndexWriter.mergeFactor? If so, try lowering it to the default, 10.
> Are
> you also opening IndexReaders in the same process? If so, keep just
> one per
> index, shared by all search threads, and, if possible, only open a new
> one
> when the index has just been optimized. Ideally, document additions
> should
> be batched, and finished by a call to optimize(). Not only do optimized
> indexes have fewer files open, but they're must faster to search.
>
> Strictly speaking, since there is only supposed to be a single writer
> for an
> index at a time, IndexWriter does not need to keep files open except
> when it
> is using them. So the number of file handles used while indexing could
> be
> reduced if IndexWriter were permitted to open IndexReaders in a special
> private mode, where files are opened on demand and closed prompty. That
> said, this might permit you to more easily create an index that you
> cannot
> read!
>
> On the upside, at search time, each query used to open a file per term
> (two
> files per phrase term) per segment. So big queries, or lots of
> concurrent
> small ones, used to run out of file handles. This is no longer the
> case.
> IndexReader now opens every file once and only once. Now it just keeps
> most
> of them open...
>
> Doug
>
>
>
------
Daryl Thachuk daryl[at]montagetech.com
Montage Technologies Inc.
http://www.montagetech.com


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


wdavies at overture

Nov 2, 2001, 2:22 PM

Post #14 of 18 (929 views)
Permalink
RE: Indexing problem [In reply to]

Hi,

I implemented this, and I do have a question --
What should the MergeFactor and MaxMergeDocuments be ?


Cheers,
Winton


>Well, I don't know if there's an archive of the list, so this what Doug
>wrote:
>
>"
>A more efficient and slightly more complex approach would be to build
>large
>indexes in RAM, and copy them to disk with IndexWriter.addIndexes:
> IndexWriter fsWriter = new IndexWriter(new File(...), analyzer, true);
> while (... more docs to index...)
> RAMDirectory ramDir = new RAMDirectory();
> IndexWriter ramWriter = new IndexWriter(ramDir, analyzer, true);
> ... add 100,000 docs to ramWriter ...
> ramWriter.optimize();
> ramWriter.close();
> fsWriter.addIndexes(new Directory[] { ramDir });
> }
> fsWriter.optimize();
> fsWriter.close();
>"
>
>Scott
>
>> -----Original Message-----
>> From: Paul Friedman [mailto:pfriedman[at]macromedia.com]
>> Sent: Friday, November 02, 2001 9:13 AM
>> To: 'Lucene Users List'
>> Subject: RE: Indexing problem
>>
>>
>> Where can I get Doug's example of indexing in memory and then
>> writing it out
>> to disk? I just recently subscribed to this list and I can't
>> find it in the
>> archive.
>>
>> Thanks.
>> Paul
>>
>> -----Original Message-----
>> From: Scott Ganyo [mailto:scott.ganyo[at]eTapestry.com]
>> Sent: Friday, November 02, 2001 7:14 AM
>> To: 'Lucene Users List'
>> Subject: RE: Indexing problem
>>
>>
>> Yes. You have too many open files. There are a few things
>> you can try. 1)
>> Increase the number of file handles your system has
>> available. Yes, there
>> is a setting for this in Windows. 2) Make sure that you have the
>> IndexWriter.maxMergeDocs set to Integer.MAX_VALUE (the
>> default). 3) Try
>> smaller values for IndexWriter.mergeFactor (default is 10).
>> 4) When all
>> else fails, do all your indexing in memory and then write it
>> out to disk
>> when you're done. Doug posted an example of this just a
>> couple days ago.
>>
>> Scott
>>
>> > -----Original Message-----
>> > From: Daryl Thachuk [mailto:daryl[at]montagetech.com]
>> > Sent: Thursday, November 01, 2001 7:46 PM
>> > To: lucene-user[at]jakarta.apache.org
>> > Subject: Indexing problem
>> >
>> >
>> > Hi
>> >
>> > Since upgrading to 1.2 we've started getting the following
>> > error when
>> > creating an index in a directory with a large amount of
>> > files. Previous
>> > versions of Lucene were quite happy to index this directory.
>> >
>> > Any thoughts as to the cause?
>> >
>> > -d
>> >
>> > java.io.FileNotFoundException:
>> > /private/Network/Servers/montage/Volumes/Disk2/Users/daryl/Lib
>> > rary/Index/index5.
>> > mtlibx/_n8.f3 (Too many open files)
>> > at java.io.RandomAccessFile.open(Native Method)
>> > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98)
>> > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:143)
>> > at
>> > org.apache.lucene.store.FSInputStream$Descriptor.<init>(Unknown
>> > Source)
>> > at org.apache.lucene.store.FSInputStream.<init>(Unknown Source)
>> > at org.apache.lucene.store.FSDirectory.openFile(Unknown Source)
>> > at
>> > org.apache.lucene.index.SegmentReader.openNorms(Unknown Source)
>> > at org.apache.lucene.index.SegmentReader.<init>(Unknown Source)
>> > at
>> > org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
>> > at
>> > org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
>> > at
>> > org.apache.lucene.index.IndexWriter.maybeMergeSegments(Unknown
>> > Source)
>> > at
>> > org.apache.lucene.index.IndexWriter.addDocument(Unknown Source)
>> > at IndexCreator.indexDocs(IndexCreator.java:75)
>> > at IndexCreator.indexDocs(IndexCreator.java:67)
>> > at IndexCreator.indexDocs(IndexCreator.java:67)
>> > at IndexCreator.indexDocs(IndexCreator.java:67)
>> > at IndexCreator.indexDocs(IndexCreator.java:67)
>> > at IndexCreator.indexDocs(IndexCreator.java:67)
>> > at IndexCreator.createIndex(IndexCreator.java:44)
>> >
>> > ------
>> > Daryl Thachuk daryl[at]montagetech.com
>> > Montage Technologies Inc.
>> > http://www.montagetech.com
>> >
>> >
>> > --
>> > To unsubscribe, e-mail:
>> <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
>> For additional commands, e-mail:
>> <mailto:lucene-user-help[at]jakarta.apache.org>
>>
>> --
>> To unsubscribe, e-mail:
>> <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
>> For additional commands, e-mail:
>> <mailto:lucene-user-help[at]jakarta.apache.org>
>>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


pfriedman at macromedia

Nov 9, 2001, 10:03 AM

Post #15 of 18 (927 views)
Permalink
RE: Indexing problem [In reply to]

When using the FSWriter, the actual file io doesn't occur until I close the writer, right? So wouldn't it be just as efficient to do the following:

IndexWriter fsWriter = new IndexWriter(new File(...), analyzer, false);
while (... more docs to index...)
... add 100,000 docs to fsWriter ...
}
fsWriter.optimize();
fsWriter.close();

-----Original Message-----
From: Scott Ganyo [mailto:scott.ganyo[at]eTapestry.com]
Sent: Friday, November 02, 2001 10:47 AM
To: 'Lucene Users List'
Subject: RE: Indexing problem


Well, I don't know if there's an archive of the list, so this what Doug
wrote:

"
A more efficient and slightly more complex approach would be to build large
indexes in RAM, and copy them to disk with IndexWriter.addIndexes:
IndexWriter fsWriter = new IndexWriter(new File(...), analyzer, true);
while (... more docs to index...)
RAMDirectory ramDir = new RAMDirectory();
IndexWriter ramWriter = new IndexWriter(ramDir, analyzer, true);
... add 100,000 docs to ramWriter ...
ramWriter.optimize();
ramWriter.close();
fsWriter.addIndexes(new Directory[] { ramDir });
}
fsWriter.optimize();
fsWriter.close();
"

Scott

> -----Original Message-----
> From: Paul Friedman [mailto:pfriedman[at]macromedia.com]
> Sent: Friday, November 02, 2001 9:13 AM
> To: 'Lucene Users List'
> Subject: RE: Indexing problem
>
>
> Where can I get Doug's example of indexing in memory and then
> writing it out
> to disk? I just recently subscribed to this list and I can't
> find it in the
> archive.
>
> Thanks.
> Paul
>
> -----Original Message-----
> From: Scott Ganyo [mailto:scott.ganyo[at]eTapestry.com]
> Sent: Friday, November 02, 2001 7:14 AM
> To: 'Lucene Users List'
> Subject: RE: Indexing problem
>
>
> Yes. You have too many open files. There are a few things
> you can try. 1)
> Increase the number of file handles your system has
> available. Yes, there
> is a setting for this in Windows. 2) Make sure that you have the
> IndexWriter.maxMergeDocs set to Integer.MAX_VALUE (the
> default). 3) Try
> smaller values for IndexWriter.mergeFactor (default is 10).
> 4) When all
> else fails, do all your indexing in memory and then write it
> out to disk
> when you're done. Doug posted an example of this just a
> couple days ago.
>
> Scott
>
> > -----Original Message-----
> > From: Daryl Thachuk [mailto:daryl[at]montagetech.com]
> > Sent: Thursday, November 01, 2001 7:46 PM
> > To: lucene-user[at]jakarta.apache.org
> > Subject: Indexing problem
> >
> >
> > Hi
> >
> > Since upgrading to 1.2 we've started getting the following
> > error when
> > creating an index in a directory with a large amount of
> > files. Previous
> > versions of Lucene were quite happy to index this directory.
> >
> > Any thoughts as to the cause?
> >
> > -d
> >
> > java.io.FileNotFoundException:
> > /private/Network/Servers/montage/Volumes/Disk2/Users/daryl/Lib
> > rary/Index/index5.
> > mtlibx/_n8.f3 (Too many open files)
> > at java.io.RandomAccessFile.open(Native Method)
> > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:98)
> > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:143)
> > at
> > org.apache.lucene.store.FSInputStream$Descriptor.<init>(Unknown
> > Source)
> > at org.apache.lucene.store.FSInputStream.<init>(Unknown Source)
> > at org.apache.lucene.store.FSDirectory.openFile(Unknown Source)
> > at
> > org.apache.lucene.index.SegmentReader.openNorms(Unknown Source)
> > at org.apache.lucene.index.SegmentReader.<init>(Unknown Source)
> > at
> > org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
> > at
> > org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
> > at
> > org.apache.lucene.index.IndexWriter.maybeMergeSegments(Unknown
> > Source)
> > at
> > org.apache.lucene.index.IndexWriter.addDocument(Unknown Source)
> > at IndexCreator.indexDocs(IndexCreator.java:75)
> > at IndexCreator.indexDocs(IndexCreator.java:67)
> > at IndexCreator.indexDocs(IndexCreator.java:67)
> > at IndexCreator.indexDocs(IndexCreator.java:67)
> > at IndexCreator.indexDocs(IndexCreator.java:67)
> > at IndexCreator.indexDocs(IndexCreator.java:67)
> > at IndexCreator.createIndex(IndexCreator.java:44)
> >
> > ------
> > Daryl Thachuk daryl[at]montagetech.com
> > Montage Technologies Inc.
> > http://www.montagetech.com
> >
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help[at]jakarta.apache.org>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help[at]jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


anders at visator

Jun 13, 2002, 2:33 AM

Post #16 of 18 (926 views)
Permalink
RE: Indexing Problem [In reply to]

Or, it could be because you have to set a higher value on maxFieldLength in
the IndexWriter you use.

From the javadoc (IndexWriter):

"public int maxFieldLength

The maximum number of terms that will be indexed for a single field in a
document. This limits the amount of memory required for indexing, so that
collections with very large files will not crash the indexing process by
running out of memory.

By default, no more than 10,000 terms will be indexed for a field."


Venlig hilsen / Best regards

Anders Nielsen
Adm. direktør
_____________

Visator ApS
Kroghsgade 1
2100 Kbh. Ø
Tlf: 3555 4702
Mobil: 2671 3663
_____________



-----Original Message-----
From: Nader S. Henein [mailto:nsh[at]bayt.net]
Sent: 13. juni 2002 11:31
To: Lucene Users List; korfut[at]lycos.com
Subject: RE: Indexing Problem


attach some code or else all you'll get is speculation .. but I imagine it
has
something to do with your methid as I have indexed 40MB files

-----Original Message-----
From: none none [mailto:korfut[at]lycos.com]
Sent: Thursday, June 13, 2002 7:10 AM
To: Lucene Users List
Subject: Indexing Problem


hi,
i have a big problem, i don't know if it is a BUG or my fault.
The problem is that indexing document bigger than 50/60 KB does
an error, they are not properly indexed.
In other words i tried to add a document of 100K and one of 1 MB,
in both files i add near the end a word "korfut", my nickname, and
when i run the search i don't get any results.
I get no errors when indexing.
My document class is almost the same as the default FileDocument.java.
Can someone tell me what's going on?
Can someone do the same test i did just to make sure?

thank you.



_______________________________________________________
WIN a first class trip to Hawaii. Live like the King of Rock and Roll
on the big Island. Enter Now!
http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>



--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


erik at ehatchersolutions

Feb 18, 2006, 4:12 AM

Post #17 of 18 (926 views)
Permalink
Re: Indexing problem [In reply to]

On Feb 18, 2006, at 6:11 AM, revati joshi wrote:
> i'm facing problem while indexing the files .There r some
> files which r not in normal ASCII format having ASCII characters
> but r in some arabic,french format which i don't want to index
> those files.
> Because of these files my indexing process gets halted in
> between.so is there any class in Lucene to ignore such files.
> plz suggest some solution to this.

No, there is nothing in Lucene to help you here. Lucene does not
deal with files, only text. It is your application that needs to
deal with files appropriately. Perhaps you could simply catch and
ignore exceptions to deal with this situation?

Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


vsirishreddy at yahoo

Nov 15, 2007, 1:15 PM

Post #18 of 18 (298 views)
Permalink
Re: Indexing Problem [In reply to]

Wow!! Thanks dude... that works... I have spent almost a day figuring out the
issue... I appreciate it!!

Erick Erickson wrote:
>
> Your problem is probably, that by default, Lucene stops after
> 10,000 terms. See IndexWriter.SetMaxFieldLength
>
> Best
> Erick
>
> On Nov 15, 2007 1:42 PM, Sirish <vsirishreddy[at]yahoo.co.in> wrote:
>
>>
>> The following is my code snippet for indexing the text:
>>
>> document.add(Field.Text(IFIELD_TEXT, billMeasureDoc.getText()));
>>
>> When ever the text is less or short, it works perfectly. But in few of
>> the
>> cases if the text is too lengthy; i.e. around 1000 lines or more then it
>> causes a problem.
>>
>> The problem being when the text is lengthy, after indexing, the document
>> is
>> getting searched only upto a certain extent of text. For eg: Lets say we
>> have a text:
>>
>> Test Test Test Test Test Test Test Test Test
>> .......... .......... .......... .......... .......... .......... .....
>> Test1 Test1 Test1 Test1 Test1 Test1 Test1
>> .......... .......... .......... .......... .......... .......... .....
>> Test2 Test2 Test2 Test2 Test2 Test2 Test2
>>
>> Now while searching, it returns the document only if I search for either
>> Test or Test1 and it ignores any text that is trailing after Test1.
>>
>> Can someone let me know if there is any text or character size limitation
>> for Field.Text in my above code. Also I would like to store this text as
>> I
>> need to implement highlighting for the search text.
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Indexing-Problem-tf4816336.html#a13778913
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>
>>
>
>

--
View this message in context: http://www.nabble.com/Indexing-Problem-tf4816336.html#a13781957
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.