
nate at verse
May 21, 2008, 2:35 PM
Post #3 of 4
(912 views)
Permalink
|
On Mon, May 19, 2008 at 12:06 AM, Marvin Humphrey <marvin [at] rectangular> wrote: > > On May 17, 2008, at 11:19 PM, Nathan Kurz wrote: >> Disk access, at around 100 MB/s, is almost too slow >> to consider, although if the tail of infrequent words is long enough >> it might be possible to save RAM by keeping the smallest posting lists >> on disk. > > But since the virtual memory system is going to cache that stuff anyhow, > wouldn't you prefer to leave memory management to the OS? While doing so would provide good average response time, it would have a very poor worst-case time. If a search for a very common word is rare, you might need to read a couple GB per shard. Thus if one wants a short maximum response time for expensive queries, one needs to keep the large stuff pegged in memory with something like mlock(). Leaving the small stuff to be buffered by the system might work, but I'm not sure the savings is worth the complexity. > It's probably also best to divide up machines by tasks. The machines which > are responsible for scanning through posting lists should probably not also > be responsible for retrieving stored docs and applying highlighting. Yes, they definitely should at least be logically split. In practice, it might work well to do these on the same machine, since scanning should be Memory Bandwidth limited and retrieving will be Disk I/O limited. > The > search machines should merely return document identifiers to the boss > node(s), which will then request docs from the storage cluster. Yes, and I'd love to have a simple protocol (likely HTTP) that works between these pieces. > Or maybe it's best to just let solutions arise on their own. Distributed > indexing is easier than distributed searching. I think this can be thought about later. There are many solutions here that could work. >> To keep the latency down, this may require hosting multiple smaller >> shards per machine. > > That could work. How about one shard per processor? I think that might > actually be a good way to exploit multiple processors -- better than > ithreads (which KS doesn't do anyhow). It won't be more than that, but depending on how processor efficient we can make things, I think it's going to be less than that. It's going to be different for different setups. I'm fuzzy on the details, but I think it's going to depend more on the number of sockets than the number of cores. Probably best to assume it needs to be configurable. > First, QueryFilter objects can't be passed between nodes because they're > implemented as cached BitVectors, which are too big to send over the network > with each search. To address this, you'd have to cook up special > SearchServer subclasses that kept their own QueryFilter objects around and > knew how to interpret special requests from the SearchClient. I think this is where my vision starts to differ from yours. The concept of a BitVector implies index centralization, where I'd rather have a lightweight Boss only loosely connected to each Searcher. Anything that was previously done with a QueryFilter could instead be done with an extra AndQuery clause against an unanalyzed field. > Second, KinoSearch's present multi-machine sorting implementation isn't very > well-done IMO. Not a worry. In the short-run this is unlikely to be a problem, and in the long run it can be optimized. > Last, Searchable's collect() API, soon to go public... > > $searcher->collect( query => $query, collector => $hit_collector ); > > ... is single-machine as well, because the HitCollector would have to be > serialized and sent to each node. I'm not following why HitCollector would need to be serialized, but this might be a limitation of my procedural world view. I would want things to be more loosely coupled than this. Rather than using the same KinoSearch architecture for both the Boss and the Worker, I would have them interact via a limited and interface (probably via HTTP, although possibly FCGI or memcached protocol). The Boss would receive a text query, and pass it on --- as text --- to a pool of Searchers. The Searchers, using the current KinoSearch implementation wrapped by a thin persistence layer, would perform the search and return a list of Documents and Scores. The Boss would select the highest results from the responses (skipping Workers that take too long), and (if desired) make a second set of requests to another pool to get highlighted excerpts. The Boss would be completely ignorant of how the Searchers search --- for all it knows, they could be running Lucene, making a SQL query, or even be proxying to Google. The Searchers would be completely independent of one another --- even document numbers would not need to be presumed unique. Are there disadvantages to this approach? I like its transparency, and that it doesn't add complexity to the KinoSearch core. Nathan Kurz nate [at] verse _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|