
marvin at rectangular
May 23, 2007, 8:06 AM
Post #4 of 4
(572 views)
Permalink
|
On May 23, 2007, at 3:46 AM, Roger Dooley wrote: >> Building the sort cache is a one-time cost if you reuse the >> Searcher/Reader. > > Is that per new search? Any new search needs to come back quickly > as well. The first search will always be sluggish by comparison. If your application logic allows it, you can warm up a Searcher with a dummy query. KS makes the engineering tradeoff of requiring significant caching up- front to reduce costs of later searches. It's another way that the behavior differs from that of the relational database systems many people are familiar with. >>> 4.90 0.369 0.369 155 0.0024 0.0024 >>> KinoSearch::Index::DocVector::_extract_tv_cache >>> 4.30 0.324 0.914 155 0.0021 0.0059 >>> KinoSearch::Highlight::Highlighter::_gen_excerpt >>> 3.11 0.234 0.234 4673 0.0001 0.0001 >>> KinoSearch::Store::InStream::lu_read >> Those are all fetch and highlight related. How large are your >> documents on average? The numbers seem high. You might consider >> not storing only a portion as a separate field. > > The documents are probably around 1k. We have about 1.5 million of > them. OK, that's not so big. Interesting -- the time to retrieve and highlight dominates this particular search. It's hard to say whether this would hold true over most searches, though. We're still in need of benchmarkers for search-time. Theoretically, your one-off benchmark numbers bode well for scalability. The cost of fetching/highlighting is roughly proportional to the number of documents retrieved per search rather than the size of the index, once you allow for slightly more hard disk seek time when retrieving docs and doc vectors that are more spaced out. The true limiting factor for most people, scalability- wise, is the time it takes to score hits, and that's not even registering. However, if this was a simple search for one comparatively rare term, that would mislead us. It's anecdotal evidence and we can't draw conclusions. In apps where the search *has* to be performed cold, though, scalability may be limited by cache-loading time. The more fields you enable sorting against, the higher this cost. Looking to the future... if you're sorting by date, the addition of an epoch fixed-length field type to KS could cut down load-time, as it would be less costly to unpack than a text field. However, I don't expect to get to that task soon. Marvin Humphrey Rectangular Research http://www.rectangular.com/
|