
marvin at rectangular
May 9, 2007, 6:47 AM
Post #2 of 2
(605 views)
Permalink
|
|
Coding and schema strategies for partial-match search?
[In reply to]
|
|
On May 9, 2007, at 1:49 AM, Roger Burton West wrote: > The document collection to which I intend to apply KinoSearch is > set up > as a file hierarchy. As it might be: > > /foo/bar/1.html > /foo/bar/2.html > /foo/baz/1.html > /qux/1.html > > I need to be able to restrict searches by file path, automatically > including all subdirectories of the specified directory: for example, > someone might want to search "only /foo/bar", or "only /foo" (which > would include /foo/bar and /foo/baz) - or in more complex cases "both > /foo/baz and /qux". I'd lean towards handling things with a phrase match: '"foo bar"'. You'll want to assign a custom analyzer to your filepath field. Don't use the stock PolyAnalyzer, because you don't want stemming. I'd recommend a Tokenizer, possibly augmented with an LCNormalizer (wrapping the two in a custom PolyAnalyzer) if you want case- insensitive matching. The default Tokenizer will break '/foo/bar/ 1.html' into four tokens: qw( foo bar 1 html ). You might want to supply your own token_re customized for splitting filepaths. The only question then is how to anchor it so that you don't match / home/foo/bar/. The easiest way to pull that off is to prepend a symbolic "root" to each directory name, e.g. "/rootdirectory/foo/ bar", at both index-time and search-time. You'd add the restriction to the search by creating a main BooleanQuery with two required clauses: the user query, probably generated by taking the output from a QueryParser, and a query representing the filepath restriction, which would probably be a BooleanQuery itself with multiple optional sub PhraseQueries. There's an example of how to do something similar (limiting search by the contents of a 'category' field) in the BooleanQuery docs. Marvin Humphrey Rectangular Research http://www.rectangular.com/
|