
nate at verse
Aug 1, 2008, 12:32 PM
Post #6 of 6
(4633 views)
Permalink
|
On Wed, Jul 30, 2008 at 1:38 PM, Marvin Humphrey <marvin [at] rectangular> wrote: > > On Jul 30, 2008, at 11:03 AM, Nathan Kurz wrote: > >> The bit vectors become >> unworkable once you have a large number of facets (think of using >> 'author' as a facet on a large dataset), > > If you anticipate needing a large number of facets with small data sets, > that's where you don't use BitVectors any more. Great, I think we agree here. The question might be what a 'large number' means in this context. My guess is that the 'large number' break point is somewhere between 10 and 100. My other guess (probably skewed by my own interests) is that most of the interesting applications are going to require 'large numbers' of facets, often 'very large numbers'. Since solutions that work with a the large number of facets can (IMO) work adequately for the situations with small numbers (but not vice versa), the large number solution strikes me as primary. > Instead you use a short > array of sorted integers, possibly compressed -- I think Solr uses a class > called SortedVIntList for this. If we added such a feature, we'd need to > create methods to and() and or() BitVectors against these integer lists > instead of against other BitVectors. Rather than adding a separate infrastructure for doing Boolean operations across BitVectors and SortedVIntList, I would strongly suggest shoehorning these structures into the existing index structures and use the existing scoring system. Just as as should be possible for an index to use P4Delta compression, it should be possible for an index to store information as a BitVector. This buys you a lot more flexibility with a cleaner infrastructure. I'm also suggesting that the current field specific inverted index would work fine for many applications. >> and the wrapped query >> approach doesn't spread well to a cluster (as one would be forced to >> round-trip the full list of matching docs over the net and not just >> the top-n). > > You never need to communicate the full list of matching docs across the > network. All you need to communicate is the top-n matching docs and the > facet counts, which isn't a problem. Great. This was my misunderstanding of the proposed solution. > How to pass those facet counts, I'm not exactly sure. There are lots of solutions to this, and it doesn't strike me as a worry at all. It can be solved (and changed) after the internals are sorted out. > Caching result seets for each facet into BitVectors or what-have-you is > important because matching a particular facet is likely to be an expensive > query. Here are the top facet counts for a search at Amazon for "moon": > > * Books (343,415) > * MP3 Downloads (35,160) > * Home & Garden (22,234) > * Music (7,086) > ... > > We don't want to execute 'category:books', 'category:mp3_downloads', etc. > with each search query. Therefore, we have to cache the result sets for > those facets. You are right that we don't want to go through the inverted index for each of the possible facets, but this is why I was suggesting that we could step through the forward indexed (doc->termlist) DocVectors to count facets. While this could be counted as expensive, I think it would be on a par with scoring a relatively common search term in the inverted index, say, the search for 'moon' itself. I also think it would be more efficient than merging 100 BitVectors, although less efficient than merging 10. > If I understand correctly, we'd get an even bigger benefit from index-time > preparation of facet result sets when there are several search processes, > since the data for the facet result sets would be cached in shared system > memory pages, instead of in per-process BitVector objects. Yes, a fine plan. There is definitely a benefit to having these cached in files rather than in per-process memory. Accessing these files via mmap gives some extra benefit, but since the file system effectively uses mmap internally, you'll get most of this cross-proces benefit using standard IO as well. > I think there's some confusion as to the role of DocVector. Each DocVector > is basically a self-contained single-document inverted index, and retrieving > a DocVector is approximately as expensive as retrieving a Doc. Using > DocVector while iterating through posting lists isn't feasible; retrieving > one DocVector for each of the top hits after you've finished searching is. Could it be made to be feasible? While I understand that random access to the DocVectors will be expensive, I was hoping that sequentially stepping through the TermVectors for a field could be done very efficiently, especially for unanalyzed fields with a limited vocabulary (ie, for a field used to store facets). Conceptually, stepping over a doc->termList index doesn't strike me as inherently any worse than stepping through the term->docList inverted index. >> Caching, if it is to occur, happens in a >> centralized way (memcached) at the top level for the entire query, >> rather than component by component within the cluster. > > I can't quite wrap my head around that. Can you please elaborate? I'm offering a philosophical conviction that it's best to do the caching at the highest level possible rather than complicating the lower levels. Rather than farming the request out to the cluster and having each member check a local cache for facet values, I'm suggesting that it would be better for the central dispatch to cache the finished client response. I think this is easier to scale, makes better use of system resources, and yields simpler and easier to debug code. >> ps. I can't tell at a glance how KinoSearch currently treats fields >> in its indexes. For example, does each field get its own lexicon? > > Yes, right now that is the case. [...] > Per-field lexicon files are not perfectly satisfactory, but explaining all > the reasoning would require an extended digression. Great, that's what I was hoping. I'm not sure what the downsides are, but having per field lexicons means that limited vocabulary fields can be stored much more efficiently than if they shared a lexicon with everyone else. Nathan Kurz nate [at] verse _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|