
nate at verse
Sep 15, 2007, 1:54 AM
Post #10 of 10
(1319 views)
Permalink
|
|
Re: KinoSearch::Docs::Cookbook::ReusingSearchers
[In reply to]
|
|
On 9/14/07, Marvin Humphrey <marvin [at] rectangular> wrote: > > I think that coming up with a good way of returning the field value to > > the requester is going to be a better final solution. > > I don't think we can get optimum performance until the whole term > dictionary resides in RAM. Certainly to solve the sort cache > problem, we need at least the whole sort field in RAM. I think I agree with this, although I think it's true to the same extent that the PostingLists need to be in RAM most of the time. The amount of data that one can transfer in from disk per request is very small, so anything else you want to use had better be cached. Take a situation like Henry's: he wants average response time less than a second. During this time he can read at most 100 MB from disk. If he is indexing unstopped English, the vector for the term 'the' is going to be something like 1/20th the entire index size. Thus if his index is greater than 2GB and he does not have 'the' cached, he's not going to be able to read it off disk in time. So assume he does have it cached, along with with as many terms as he needs to get largest uncached term to be less than 100 MB. I'm not quite sure how the curve really goes, but at a glance (http://www.comp.lancs.ac.uk/ucrel/bncfreq/) it looks like top 20 terms are going to be about 1/4th the index size, and that the 20th term is about 1/200th on it own . So with 100 GB of index per node, you could cache 25 GB of top terms and any single term you had to read would be less than 100 MB. I don't know how many of his documents will fill 100 GB of index, but he mentioned his total corpus was 30 million, so let's use that number as a high guess. 30 million documents * 30 bytes per doc for the sorted field's value is less than 1 GB. So while I agree with you that reading that from disk in less than a second is out of the question, I feel like it's small compared to the size of the other stuff that already needs to be in cache. > Another avenue of attack might be to load the sort field's .lex files > into RAM, but not decompress them. Then we'd use an InStream with an > inner RAMFileDes (instead of an FSFileDes). No more disk seeks. I think it's simpler than you are thinking: they are not that big, and if you read them often the system is going to cache them for you. Are you familiar with the virtual file system works in Linux? http://www.linux-security.cn/ebooks/ulk3-html/0596005652/understandlk-CHP-16.html All unused memory in the system is used to cache the filesystem, so the second time a file is read the actual hard disk isn't touched at all. And since we are limited as to how much disk information we can read per search, anything in the cache is likely to stay there if we use it every few searches. --- So while I think understand the problem you are trying to solve, I'm not sure that the straightforward solution (do the sort locally and send along the value) is really that unwieldy. Nathan Kurz nate [at] verse ps. There's a wonderful description of the VFS that made me think of your BoilerPlater implementation when I read it: http://www.spinellis.gr/pubs/inbook/beautiful_code/html/Spi07g.html _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|