Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

How to index a lot of fields (without FileNotFoundException: Too many open files)

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


pbm-rico at gmx

Apr 27, 2007, 4:03 PM

Post #1 of 8 (318 views)
Permalink
How to index a lot of fields (without FileNotFoundException: Too many open files)

Hello,

What would be the best strategy to support an index with thousands or even hundreds of thousands of individual field names?

I have client applications that create a lot of key/value type data. I use the key as document field name so I end up with _a lot_ of .f<m> files and
eventually the my application breaks with: FileNotFoundException: Too many open files.

I searched this list and it seems like others had this problem before, but I could not find a solution.

From what I read in the Lucene docs, these .f<m> files store the normalization factor for the corresponding field. What exactly is this used for and more importantly, can this be disabled so that the files are not created in the first place?

Any advice is appreciated.

Thanks,
Rico.


--
"Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ...
Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


hossman_lucene at fucit

Apr 27, 2007, 5:25 PM

Post #2 of 8 (313 views)
Permalink
Re: How to index a lot of fields (without FileNotFoundException: Too many open files) [In reply to]

: >From what I read in the Lucene docs, these .f<m> files store the
: normalization factor for the corresponding field. What exactly is this
: used for and more importantly, can this be disabled so that the files
: are not created in the first place?

field norms are primarily used for length normalization, but they also
store field and document boosts ... more info can be found in the scoring
and Similarity docs.

if you don't care about these things, there is an omitNorms option on the
Field class that can be used to not only reduce the number of open
files you need, but also make your index smaller on disk, and reduce your
memory usage. (i believe it was added in 1.9)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


DORONC at il

Apr 27, 2007, 10:00 PM

Post #3 of 8 (303 views)
Permalink
Re: How to index a lot of fields (without FileNotFoundException: Too many open files) [In reply to]

Just in case norms info cannot be spared, note that since Lucene 2.1 norms
are maintained in a single file, no matter how many fields there are.

However due to a bug in 2.1 this did not prevent the too many open files
problem. This bug was already fixed but not yet released. For more details
on this fix see LUCENE-821.

Doron

Chris Hostetter <hossman_lucene [at] fucit> wrote on 27/04/2007 17:25:08:

>
> : >From what I read in the Lucene docs, these .f<m> files store the
> : normalization factor for the corresponding field. What exactly is this
> : used for and more importantly, can this be disabled so that the files
> : are not created in the first place?
>
> field norms are primarily used for length normalization, but they also
> store field and document boosts ... more info can be found in the scoring
> and Similarity docs.
>
> if you don't care about these things, there is an omitNorms option on the
> Field class that can be used to not only reduce the number of open
> files you need, but also make your index smaller on disk, and reduce your
> memory usage. (i believe it was added in 1.9)
>
> -Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


pbm-rico at gmx

Apr 30, 2007, 8:50 AM

Post #4 of 8 (298 views)
Permalink
Re: Re: How to index a lot of fields (without FileNotFoundException: Too many open files) [In reply to]

Thanks for you reply.

We are still using Lucene v1.4.3 and I'm not sure if upgrading is an option. Is there another way of disabling length normalization/document boosts to get rid of those files?

Thanks,
Rico

: >From what I read in the Lucene docs, these .f<m> files store the
: normalization factor for the corresponding field. What exactly is this
: used for and more importantly, can this be disabled so that the files
: are not created in the first place?

field norms are primarily used for length normalization, but they also
store field and document boosts ... more info can be found in the scoring
and Similarity docs.

if you don't care about these things, there is an omitNorms option on the
Field class that can be used to not only reduce the number of open
files you need, but also make your index smaller on disk, and reduce your
memory usage. (i believe it was added in 1.9)



-Hoss

--
"Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ...
Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


mike.klaas at gmail

Apr 30, 2007, 3:08 PM

Post #5 of 8 (298 views)
Permalink
Re: Re: How to index a lot of fields (without FileNotFoundException: Too many open files) [In reply to]

On 4/30/07, pbm-rico [at] gmx <pbm-rico [at] gmx> wrote:
> Thanks for you reply.
>
> We are still using Lucene v1.4.3 and I'm not sure if upgrading is an option. Is there another way of disabling length normalization/document boosts to get rid of those files?

Why not raise the limit of open files on your system? (man ulimit)

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


pbm-rico at gmx

Apr 30, 2007, 3:22 PM

Post #6 of 8 (296 views)
Permalink
Re: Re: How to index a lot of fields (without FileNotFoundException: Too many open files) [In reply to]

I thought about using ulimit, but it does not scale. In the scenario that the app has to support, client applications could create hundreds of thousands of unique properties, which would result in this many indexable fields.

Based on previous answers, the way out of this problem while still being scalable would be to use 2.1 where there is only 1 norm file.

However, it does not look like upgrading is an option, so I wonder if my current approach of mapping a property that a client app creates to one field name is workable at all. Maybe I have to introduce some sort of mapping of client properties to a fixed number of indexable fields.

...or modify the 1.4.3 code myself to get rid of norm files.

-Rico

-------- Original-Nachricht --------
Datum: Mon, 30 Apr 2007 15:08:14 -0700
Von: "Mike Klaas" <mike.klaas [at] gmail>
An: java-user [at] lucene
Betreff: Re: Re: How to index a lot of fields (without FileNotFoundException: Too many open files)

> On 4/30/07, pbm-rico [at] gmx <pbm-rico [at] gmx> wrote:
> > Thanks for you reply.
> >
> > We are still using Lucene v1.4.3 and I'm not sure if upgrading is an
> option. Is there another way of disabling length normalization/document boosts
> to get rid of those files?
>
> Why not raise the limit of open files on your system? (man ulimit)
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene

--
"Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ...
Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


hossman_lucene at fucit

Apr 30, 2007, 4:45 PM

Post #7 of 8 (294 views)
Permalink
Re: Re: How to index a lot of fields (without FileNotFoundException: Too many open files) [In reply to]

: However, it does not look like upgrading is an option, so I wonder if my
: current approach of mapping a property that a client app creates to one
: field name is workable at all. Maybe I have to introduce some sort of
: mapping of client properties to a fixed number of indexable fields.
:
: ...or modify the 1.4.3 code myself to get rid of norm files.

if you can't upgrade past 1.4.3, but you can modify 1.4.3 to change
things (like file formats), why not "modify" your 1.4.3 install to be a
2.1 install?




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul.elschot at xs4all

May 1, 2007, 12:31 AM

Post #8 of 8 (287 views)
Permalink
Re: How to index a lot of fields (without FileNotFoundException: Too many open files) [In reply to]

On Tuesday 01 May 2007 00:22, pbm-rico [at] gmx wrote:
> I thought about using ulimit, but it does not scale. In the scenario that
the app has to support, client applications could create hundreds of
thousands of unique properties, which would result in this many indexable
fields.
>
> Based on previous answers, the way out of this problem while still being
scalable would be to use 2.1 where there is only 1 norm file.
>
> However, it does not look like upgrading is an option, so I wonder if my
current approach of mapping a property that a client app creates to one field
name is workable at all. Maybe I have to introduce some sort of mapping of
client properties to a fixed number of indexable fields.
>
> ...or modify the 1.4.3 code myself to get rid of norm files.

Do you need the idf computations of lucene for each of these client fields
separately?
When not, another way is to move the names of the client fields into the
term value, and use a single special field for these.
When you do that, you'll also have to move the client field names
in the queries from the field name to the term. This can easily be done by
overriding one of the methods in QueryParser.

Regards,
Paul Elschot


>
> -Rico
>
> -------- Original-Nachricht --------
> Datum: Mon, 30 Apr 2007 15:08:14 -0700
> Von: "Mike Klaas" <mike.klaas [at] gmail>
> An: java-user [at] lucene
> Betreff: Re: Re: How to index a lot of fields (without
FileNotFoundException: Too many open files)
>
> > On 4/30/07, pbm-rico [at] gmx <pbm-rico [at] gmx> wrote:
> > > Thanks for you reply.
> > >
> > > We are still using Lucene v1.4.3 and I'm not sure if upgrading is an
> > option. Is there another way of disabling length normalization/document
boosts
> > to get rid of those files?
> >
> > Why not raise the limit of open files on your system? (man ulimit)
> >
> > -Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
>
> --
> "Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ...
> Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.