nate at verse
Sep 15, 2007, 1:54 AM
Post #10 of 10
On 9/14/07, Marvin Humphrey <marvin [at] rectangular> wrote:
[In reply to]
> > I think that coming up with a good way of returning the field value to
> > the requester is going to be a better final solution.
> I don't think we can get optimum performance until the whole term
> dictionary resides in RAM. Certainly to solve the sort cache
> problem, we need at least the whole sort field in RAM.
I think I agree with this, although I think it's true to the same
extent that the PostingLists need to be in RAM most of the time. The
amount of data that one can transfer in from disk per request is very
small, so anything else you want to use had better be cached.
Take a situation like Henry's: he wants average response time less
than a second. During this time he can read at most 100 MB from disk.
If he is indexing unstopped English, the vector for the term 'the'
is going to be something like 1/20th the entire index size. Thus if
his index is greater than 2GB and he does not have 'the' cached, he's
not going to be able to read it off disk in time.
So assume he does have it cached, along with with as many terms as he
needs to get largest uncached term to be less than 100 MB. I'm not
quite sure how the curve really goes, but at a glance
(http://www.comp.lancs.ac.uk/ucrel/bncfreq/) it looks like top 20
terms are going to be about 1/4th the index size, and that the 20th
term is about 1/200th on it own . So with 100 GB of index per node,
you could cache 25 GB of top terms and any single term you had to read
would be less than 100 MB.
I don't know how many of his documents will fill 100 GB of index, but
he mentioned his total corpus was 30 million, so let's use that number
as a high guess. 30 million documents * 30 bytes per doc for the
sorted field's value is less than 1 GB. So while I agree with you
that reading that from disk in less than a second is out of the
question, I feel like it's small compared to the size of the other
stuff that already needs to be in cache.
> Another avenue of attack might be to load the sort field's .lex files
> into RAM, but not decompress them. Then we'd use an InStream with an
> inner RAMFileDes (instead of an FSFileDes). No more disk seeks.
I think it's simpler than you are thinking: they are not that big,
and if you read them often the system is going to cache them for you.
Are you familiar with the virtual file system works in Linux?
All unused memory in the system is used to cache the filesystem, so
the second time a file is read the actual hard disk isn't touched at
all. And since we are limited as to how much disk information we can
read per search, anything in the cache is likely to stay there if we
use it every few searches.
So while I think understand the problem you are trying to solve, I'm
not sure that the straightforward solution (do the sort locally and
send along the value) is really that unwieldy.
nate [at] verse
ps. There's a wonderful description of the VFS that made me think of
your BoilerPlater implementation when I read it:
KinoSearch mailing list
KinoSearch [at] rectangular