
marvin at rectangular
Aug 29, 2006, 7:16 PM
Post #4 of 4
(174 views)
Permalink
|
On Aug 28, 2006, at 4:04 PM, Simon Wistow wrote: > We're currently looking at about 10 Tb of data FYI, I don't know of anyone who's built a KS-based search engine on that scale. I'd really like to see somebody make the attempt, but I can't say what will happen. ;) If the search-time performance turns out to be too slow, eventually KS will have a cluster system available. Multiple machines would need to communicate with each other prior to the main search processing: each term needs to get an IDF in order for scoring to proceed, and you have to know the total size of the collection and the number of docs the term appears in to calculate it. If a search requested 10 docs, each machine would produce 10, then the top 10 scorers out of the aggregate group would be your final search results. If you want to see how this can work, take a look at Lucene's MultiSearcher class. Working up such a cluster system is not currently at the top of my to- do list. (Finding steady work is, followed by FieldCache, Lucy, etc...) > to index with approx 16 new updates a second which we'd like to merge > within about 40 seconds so any help would be gratefully appreciated. The only way you're going to be able to manage this is if a new option gets added to finish(), and even then it will be challenging. This would also hold true for Lucene, because the problem is the same: inverted index data structures are optimized for fast searching, not instant updates. Merging of segments takes an unpredictable amount of time. Usually it will be pretty quick, but every once in a while, when large segments get merged, it will take a while. Furthermore, when you update an index, you need to open a new Searcher and warm up its caches. That cost is much easier to predict, though, as the warm-up time tracks the size of the index closely. Incremental indexing in KS/Lucene works something like this: Add a book, generate an index. Add another chapter, generate a second index. Now you need to leaf through two indexes. Add another chapter, generate a third index. Now you've got three indexes to look through each time you search. ... Those indexes get merged periodically, and the interleaving process takes time proportional to the size of the segments being merged. The more segments you have, the longer it takes to search, though the time is still much more dependent on absolute size of index than on number of segments. Optimizing the index via... $invindexer->finish( optimize => 1 ); ... reads all existing segments and dumps them into the one currently being written, which is costly at index-time but produces a single- segment index for fastest searching. Ordinarily, KS tries to keep segments around in something approximating a Fibonacci series (as an aside, this is different from Lucene) -- the idea being to minimize the number of merges, while still providing good search-time performance. What we would need to do is STOP KinoSearch from merging any existing segments into the segment currently in progress (that's the API that's missing). Then every once in a while, you'd force-optimize the index -- probably off-line to a copy which gets swapped in when the process completes. It's a reasonable strategy, but the number of updates in your spec is quite substantial. Cheers, Marvin Humphrey -- I'm looking for a part time job.
|