Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Term Frequency vector consumes memory

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


emailgane at yahoo

Jun 30, 2009, 12:37 AM

Post #1 of 4 (435 views)
Permalink
Term Frequency vector consumes memory

At the end of the day, I used to build the stats of top indexed terms. I enabled term frequency for the single field. It is working fine. I could able to get the top terms and its frequencies. It consumes huge amount of RAM. My index size is 5 GB and has 8 million records. If i didn't enable term vector then i could do index up to 17 GB with 40 million records.

When IndexReader/ Searcher is opened, whether it will load all term vector frequncies?

Consider i have enabled this option and indexed say 5GB, Now i don't want the Reader / Searcher to load term vector. I want to switch off this feature? Is that possible without re-indexing?

Regards
Ganesh
Send instant messages to your online friends http://in.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


gsingers at apache

Jun 30, 2009, 9:18 AM

Post #2 of 4 (399 views)
Permalink
Re: Term Frequency vector consumes memory [In reply to]

In Lucene, a Term Vector is a specific thing that is stored on disk
when creating a Document and Field. It is optional and off by
default. It is separate from being able to get the term frequencies
for all the docs in a specific field. The former is decided at
indexing time and there is no way to remove it w/o reindexing.
Furthermore, it is not loaded into memory by the IndexReader. Term
Frequencies are accessed via the TermDocs.

Can you clarify a bit more what you are looking to do? Perhaps some
sample code will help demonstrate what you'd like to turn off, as I am
not clear on your question.

Cheers,
Grant

On Jun 30, 2009, at 3:37 AM, Ganesh wrote:

> At the end of the day, I used to build the stats of top indexed
> terms. I enabled term frequency for the single field. It is working
> fine. I could able to get the top terms and its frequencies. It
> consumes huge amount of RAM. My index size is 5 GB and has 8 million
> records. If i didn't enable term vector then i could do index up to
> 17 GB with 40 million records.
>
> When IndexReader/ Searcher is opened, whether it will load all term
> vector frequncies?
>
> Consider i have enabled this option and indexed say 5GB, Now i don't
> want the Reader / Searcher to load term vector. I want to switch off
> this feature? Is that possible without re-indexing?
>
> Regards
> Ganesh
> Send instant messages to your online friends http://in.messenger.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


emailgane at yahoo

Jun 30, 2009, 10:39 PM

Post #3 of 4 (387 views)
Permalink
Re: Term Frequency vector consumes memory [In reply to]

Thanks for your reply.

My requirement is to fetch the list of top frequency terms indexed in a day. I used the logic said in the article (refer below link)
http://stackoverflow.com/questions/195434/how-can-i-get-top-terms-for-a-subset-of-documents-in-a-lucene-index

I enabled term vector for a field. Indexed the content and i am able to retrieve the list of top indexed term in a day / date range.

When IndexReader/ Searcher is opened, whether it will load all term vector frequncies?

Consider i have enabled this option and indexed say 5GB, Now i don't want the Reader / Searcher to load term vector. I want to switch off
this feature? Is that possible without re-indexing?

Regards
Ganesh

----- Original Message -----
From: "Grant Ingersoll" <gsingers[at]apache.org>
To: <java-user[at]lucene.apache.org>
Sent: Tuesday, June 30, 2009 9:48 PM
Subject: Re: Term Frequency vector consumes memory


> In Lucene, a Term Vector is a specific thing that is stored on disk
> when creating a Document and Field. It is optional and off by
> default. It is separate from being able to get the term frequencies
> for all the docs in a specific field. The former is decided at
> indexing time and there is no way to remove it w/o reindexing.
> Furthermore, it is not loaded into memory by the IndexReader. Term
> Frequencies are accessed via the TermDocs.
>
> Can you clarify a bit more what you are looking to do? Perhaps some
> sample code will help demonstrate what you'd like to turn off, as I am
> not clear on your question.
>
> Cheers,
> Grant
>
> On Jun 30, 2009, at 3:37 AM, Ganesh wrote:
>
>> At the end of the day, I used to build the stats of top indexed
>> terms. I enabled term frequency for the single field. It is working
>> fine. I could able to get the top terms and its frequencies. It
>> consumes huge amount of RAM. My index size is 5 GB and has 8 million
>> records. If i didn't enable term vector then i could do index up to
>> 17 GB with 40 million records.
>>
>> When IndexReader/ Searcher is opened, whether it will load all term
>> vector frequncies?
>>
>> Consider i have enabled this option and indexed say 5GB, Now i don't
>> want the Reader / Searcher to load term vector. I want to switch off
>> this feature? Is that possible without re-indexing?
>>
>> Regards
>> Ganesh
>> Send instant messages to your online friends http://in.messenger.yahoo.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
Send instant messages to your online friends http://in.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


gsingers at apache

Jul 2, 2009, 5:45 AM

Post #4 of 4 (370 views)
Permalink
Re: Term Frequency vector consumes memory [In reply to]

On Jul 1, 2009, at 1:39 AM, Ganesh wrote:

> Thanks for your reply.
>
> My requirement is to fetch the list of top frequency terms indexed
> in a day. I used the logic said in the article (refer below link)
> http://stackoverflow.com/questions/195434/how-can-i-get-top-terms-for-a-subset-of-documents-in-a-lucene-index
>
> I enabled term vector for a field. Indexed the content and i am able
> to retrieve the list of top indexed term in a day / date range.
>
> When IndexReader/ Searcher is opened, whether it will load all term
> vector frequncies?

No, it won't. Term Vecs are stored on disk much like the stored fields.

>
> Consider i have enabled this option and indexed say 5GB, Now i
> don't want the Reader / Searcher to load term vector. I want to
> switch off
> this feature? Is that possible without re-indexing?

I suppose. Although the approach you are using seems to rely on a
custom Collector, which means you need to not use that one.

Storing Term Vecs will indeed make your index much bigger, but it
shouldn't effect memory much, unless you are caching, which probably
isn't a bad idea anyway.



>
> Regards
> Ganesh
>
> ----- Original Message -----
> From: "Grant Ingersoll" <gsingers[at]apache.org>
> To: <java-user[at]lucene.apache.org>
> Sent: Tuesday, June 30, 2009 9:48 PM
> Subject: Re: Term Frequency vector consumes memory
>
>
>> In Lucene, a Term Vector is a specific thing that is stored on disk
>> when creating a Document and Field. It is optional and off by
>> default. It is separate from being able to get the term frequencies
>> for all the docs in a specific field. The former is decided at
>> indexing time and there is no way to remove it w/o reindexing.
>> Furthermore, it is not loaded into memory by the IndexReader. Term
>> Frequencies are accessed via the TermDocs.
>>
>> Can you clarify a bit more what you are looking to do? Perhaps some
>> sample code will help demonstrate what you'd like to turn off, as I
>> am
>> not clear on your question.
>>
>> Cheers,
>> Grant
>>
>> On Jun 30, 2009, at 3:37 AM, Ganesh wrote:
>>
>>> At the end of the day, I used to build the stats of top indexed
>>> terms. I enabled term frequency for the single field. It is working
>>> fine. I could able to get the top terms and its frequencies. It
>>> consumes huge amount of RAM. My index size is 5 GB and has 8 million
>>> records. If i didn't enable term vector then i could do index up to
>>> 17 GB with 40 million records.
>>>
>>> When IndexReader/ Searcher is opened, whether it will load all term
>>> vector frequncies?
>>>
>>> Consider i have enabled this option and indexed say 5GB, Now i don't
>>> want the Reader / Searcher to load term vector. I want to switch off
>>> this feature? Is that possible without re-indexing?
>>>
>>> Regards
>>> Ganesh
>>> Send instant messages to your online friends http://in.messenger.yahoo.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>
> Send instant messages to your online friends http://in.messenger.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.