nate at verse
May 21, 2008, 2:35 PM
Post #3 of 4
On Mon, May 19, 2008 at 12:06 AM, Marvin Humphrey
<marvin [at] rectangular> wrote:
> On May 17, 2008, at 11:19 PM, Nathan Kurz wrote:
>> Disk access, at around 100 MB/s, is almost too slow
>> to consider, although if the tail of infrequent words is long enough
>> it might be possible to save RAM by keeping the smallest posting lists
>> on disk.
> But since the virtual memory system is going to cache that stuff anyhow,
> wouldn't you prefer to leave memory management to the OS?
While doing so would provide good average response time, it would have
a very poor worst-case time. If a search for a very common word is
rare, you might need to read a couple GB per shard. Thus if one wants
a short maximum response time for expensive queries, one needs to keep
the large stuff pegged in memory with something like mlock(). Leaving
the small stuff to be buffered by the system might work, but I'm not
sure the savings is worth the complexity.
> It's probably also best to divide up machines by tasks. The machines which
> are responsible for scanning through posting lists should probably not also
> be responsible for retrieving stored docs and applying highlighting.
Yes, they definitely should at least be logically split. In practice,
it might work well to do these on the same machine, since scanning
should be Memory Bandwidth limited and retrieving will be Disk I/O
> search machines should merely return document identifiers to the boss
> node(s), which will then request docs from the storage cluster.
Yes, and I'd love to have a simple protocol (likely HTTP) that works
between these pieces.
> Or maybe it's best to just let solutions arise on their own. Distributed
> indexing is easier than distributed searching.
I think this can be thought about later. There are many solutions here
that could work.
>> To keep the latency down, this may require hosting multiple smaller
>> shards per machine.
> That could work. How about one shard per processor? I think that might
> actually be a good way to exploit multiple processors -- better than
> ithreads (which KS doesn't do anyhow).
It won't be more than that, but depending on how processor efficient
we can make things, I think it's going to be less than that. It's
going to be different for different setups. I'm fuzzy on the
details, but I think it's going to depend more on the number of
sockets than the number of cores. Probably best to assume it needs to
> First, QueryFilter objects can't be passed between nodes because they're
> implemented as cached BitVectors, which are too big to send over the network
> with each search. To address this, you'd have to cook up special
> SearchServer subclasses that kept their own QueryFilter objects around and
> knew how to interpret special requests from the SearchClient.
I think this is where my vision starts to differ from yours. The
concept of a BitVector implies index centralization, where I'd rather
have a lightweight Boss only loosely connected to each Searcher.
Anything that was previously done with a QueryFilter could instead be
done with an extra AndQuery clause against an unanalyzed field.
> Second, KinoSearch's present multi-machine sorting implementation isn't very
> well-done IMO.
Not a worry. In the short-run this is unlikely to be a problem, and
in the long run it can be optimized.
> Last, Searchable's collect() API, soon to go public...
> $searcher->collect( query => $query, collector => $hit_collector );
> ... is single-machine as well, because the HitCollector would have to be
> serialized and sent to each node.
I'm not following why HitCollector would need to be serialized, but
this might be a limitation of my procedural world view. I would want
things to be more loosely coupled than this. Rather than using the
same KinoSearch architecture for both the Boss and the Worker, I would
have them interact via a limited and interface (probably via HTTP,
although possibly FCGI or memcached protocol).
The Boss would receive a text query, and pass it on --- as text --- to
a pool of Searchers. The Searchers, using the current KinoSearch
implementation wrapped by a thin persistence layer, would perform the
search and return a list of Documents and Scores. The Boss would
select the highest results from the responses (skipping Workers that
take too long), and (if desired) make a second set of requests to
another pool to get highlighted excerpts.
The Boss would be completely ignorant of how the Searchers search ---
for all it knows, they could be running Lucene, making a SQL query, or
even be proxying to Google. The Searchers would be completely
independent of one another --- even document numbers would not need to
be presumed unique.
Are there disadvantages to this approach? I like its transparency,
and that it doesn't add complexity to the KinoSearch core.
nate [at] verse
KinoSearch mailing list
KinoSearch [at] rectangular