marvin at rectangular
May 9, 2007, 6:47 AM
Post #2 of 2
On May 9, 2007, at 1:49 AM, Roger Burton West wrote:
Coding and schema strategies for partial-match search?
[In reply to]
> The document collection to which I intend to apply KinoSearch is
> set up
> as a file hierarchy. As it might be:
> I need to be able to restrict searches by file path, automatically
> including all subdirectories of the specified directory: for example,
> someone might want to search "only /foo/bar", or "only /foo" (which
> would include /foo/bar and /foo/baz) - or in more complex cases "both
> /foo/baz and /qux".
I'd lean towards handling things with a phrase match: '"foo bar"'.
You'll want to assign a custom analyzer to your filepath field.
Don't use the stock PolyAnalyzer, because you don't want stemming.
I'd recommend a Tokenizer, possibly augmented with an LCNormalizer
(wrapping the two in a custom PolyAnalyzer) if you want case-
insensitive matching. The default Tokenizer will break '/foo/bar/
1.html' into four tokens: qw( foo bar 1 html ). You might want to
supply your own token_re customized for splitting filepaths.
The only question then is how to anchor it so that you don't match /
home/foo/bar/. The easiest way to pull that off is to prepend a
symbolic "root" to each directory name, e.g. "/rootdirectory/foo/
bar", at both index-time and search-time.
You'd add the restriction to the search by creating a main
BooleanQuery with two required clauses: the user query, probably
generated by taking the output from a QueryParser, and a query
representing the filepath restriction, which would probably be a
BooleanQuery itself with multiple optional sub PhraseQueries.
There's an example of how to do something similar (limiting search by
the contents of a 'category' field) in the BooleanQuery docs.