
nseco at dei
Nov 12, 2009, 11:04 AM
Post #7 of 7
(562 views)
Permalink
|
Thanks again > I am not really sure, why it is enough for you to sort the first 50 highest > ranking hits, but if you only want to do this, sorting afterwards is quite > straightforward. > Just to clarify.. And I know it may seem strange... But I'll mostly be conducting "long" phrase (3 or 4 word) searches on the index. So I don't expect that many hits to appear given that all I have are 5-grams. If I have above 50 hits then problem solved, need not do anything else. If I have below 50 hits then I want to look at the frequency count of the hits I got. Anyway.... I was just curious about the behavior. My fault. Thanks for all the ideas. > Another idea is to not index the count itself, but more use the count as a > boost factor for each document. The ranking algorithm of lucene will then > boost docs with higher counts to the top results. But notice: The boost is > combined with the already available boots (multiplied), so the sorting will > be a combination of boost and native lucene rank. You can try to overcome > that by scaling the boost very large (factor 10 or so should be enough for > 5-grams). > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: uwe [at] thetaphi > > >> -----Original Message----- >> From: Nuno Seco [mailto:nseco [at] dei] >> Sent: Thursday, November 12, 2009 6:08 PM >> To: java-user [at] lucene >> Subject: Re: OutOfMemoryError when using Sort >> >> Ok. Thanks. >> >> The doc. says: >> "Finds the top |n| hits for |query|, applying |filter| if non-null, and >> sorting the hits by the criteria in |sort|." >> >> I understood that only the hits (50 in this) for the current search >> would be sorted... >> I'll just do the ordering afterwards. Thank you for clarifying this issue. >> >> >> -- >> Nuno Seco >> >> >> Jake Mannix wrote: >> >>> Sorting utilizes a FieldCache: the forward lookup - the value a document >>> >> has >> >>> for a >>> particular field (as opposed to the usual "inverted" way of looking at >>> >> all >> >>> documents >>> which contains a given term), which lives in memory, and takes up as >>> >> much >> >>> space >>> as one 4-bytes * numDocs. >>> >>> If you've indexed the entire 5Gram data set on one index, on one >>> >> machine, >> >>> you've >>> got how many documents? Billions? This will take up billions*4bytes of >>> >> RAM >> >>> to >>> do this sort. >>> >>> You need to shard your index (break it up onto multiple machines, do >>> >> your >> >>> sort >>> distributed, and merge the results) if you want to do this sorting with >>> >> any >> >>> kind >>> of performance. >>> >>> -jake >>> >>> On Thu, Nov 12, 2009 at 7:57 AM, Nuno Seco <nseco [at] dei> wrote: >>> >>> >>> >>>> Hello List. >>>> >>>> I'm having a problem when I add a Sort object to my searcher: >>>> docs = searcher.search(parser.parse(search), null, 50, sort); >>>> >>>> Every time I execute a query I get an OutOfMemoryError exception. >>>> But if I execute the query without the Sort object it works fine >>>> >>>> Let me briefly explain how my index is structured. >>>> I'm indexing the Google 5Grams >>>> ( >>>> http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong- >>>> >> to-you.html). >> >>>> The index just has two fields: >>>> data = new Field("data", tokens[0], Field.Store.YES, >>>> Field.Index.ANALYZED, Field.TermVector.NO); >>>> count = new Field("count", tokens[1], Field.Store.YES, >>>> Field.Index.NO, Field.TermVector.NO); >>>> >>>> the data corresponds to the 5 gram; e.g.: "my business manager informed >>>> >> me" >> >>>> and the count is simply an integer that represents the frequency of the >>>> ngram. >>>> >>>> The index size after optimization is 63G. >>>> >>>> If I do not store the data field using: >>>> data = new Field("data", tokens[0], Field.Store.NO, >>>> Field.Index.ANALYZED, Field.TermVector.NO); >>>> the total size drops to 32G >>>> >>>> >>>> But using either index with the Sort object causes the exception. I'm >>>> creating the Sort object like: >>>> Sort sort = new Sort(new SortField("count", SortField.INT)); >>>> >>>> Note: That even with out using the Sort object I still need to pump the >>>> jvm to 2G (-Xmx2048m). But thats ok... >>>> >>>> >>>> So.... Basically what I want is to order those first 50 hits I get >>>> according to their frequency counts (count field). >>>> >>>> >>>> I'm using: >>>> java version "1.6.0_16" (64 bit) >>>> lucene 2.9.1 >>>> linux ext3 FS >>>> linux kernel 2.6.31-15 >>>> >>>> Can anybody help me or redirect me in the right direction? >>>> >>>> Thanks >>>> >>>> -- >>>> Nuno Seco >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene >>>> For additional commands, e-mail: java-user-help [at] lucene >>>> >>>> >>>> >>>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene >> For additional commands, e-mail: java-user-help [at] lucene >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene > For additional commands, e-mail: java-user-help [at] lucene > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe [at] lucene For additional commands, e-mail: java-user-help [at] lucene
|