
marvin at rectangular
Jan 22, 2008, 2:35 PM
Post #1 of 6
(972 views)
Permalink
|
|
Re: KinoSearch feature suggestions
|
|
Hi, On Jan 21, 2008, at 2:16 PM, Father Chrysostomos wrote: > I’d like to request that a few features be added to KinoSearch. I > need these features myself, so I’m willing to contribute patches. > Please let me know what you think. I'm going to take the liberty of cc'ing this to the KinoSearch mailing list, since it was filed as a public rt.cpan.org issue. > 1. Wildcards in search queries I am in favor of wildcards being available via a separate distribution, and I would very much like to hammer out an elegant low- level API to support such a distro. A lot of the work I have been doing lately is intended to facilitate such endeavors. Wildcards should not be in core KS, because they are by their nature vastly more expensive than whole-word queries. I have observed that their comparative cost often comes as an unpleasant shock. However, providing a separate distro will prompt people to assess the costs with open eyes. > 2. I’d like KinoSearch::Highlight::Highlighter to be able to create > non-contiguous excerpts (which I’m calling ‘summaries’; the > contiguous sub-parts of each summary I’m calling excerpts): > > $highlighter->add_spec( excerpt_length => 50, summary_length => > 200, ...); > > The highlighter would find the most important word to highlight (as > it currently does), and create a 50-char excerpt. Then it would > create an excerpt for the second most important word and add that > (removing overlap if necessary), repeating this process until the > summary is the right length. I think this should be implemented by abstracting out the excerpt selection engine, analogous to the way that KinoSearch::Highlight::Encoder and KinoSearch::Highlight::Formatter abstract out other functionality used by the Highlighter. How about if we outsource excerpting to subclasses of a new class, KinoSearch::Highlight::Excerpter? Then you could release your own distro, e.g. KSx::Highlight::SummaryExcerpter. > 3. Custom ellipsis marks: > > $highlighter->add_spec( ellipsis_mark => "\x{2026}", ... ) I understand the problem, but adding a such a specific param to Highlighter->add_spec seems brittle. I think this should be something which is set via a custom excerpting engine. Incidentally, Highlighter's treatment of the ellipsis also prompted part of <http://rt.cpan.org/Public/Bug/Display.html?id=25400>. > 4. Pagination (another highlighter feature): An index field could > be designated as the ‘page offset’ field, containing byte offsets > of page breaks. > > $highlighter->add_spec( > page_offset_field => 'pageoffsets', > page_offset_formatter => $object, > ); > > And $object would have to have a page_label method: sub page_label > { my ($self, $fields_hashref, $page_no) = @_; ... } This feature also seems like it should belong to a particular Excerpter implementation. > Though it might be more complicated, maybe we could have page > breaks (chr 12) recorded automatically when the index is created. > Then ‘page_offset_field’ won’t be necessary. That would work well. It's trivial to implement effectively using C/ XS, because you can just zip along the string counting page breaks. long count_breaks(SV *input_sv) { STRLEN len; char *ptr = SvPV(input_sv, len); char *end = SvEND(input_sv); long count = 0; while (ptr < end) { if (*ptr++ == 12) count++; } return count; } With Perl, tr// works for efficient character counting, IIRC. > For examples of 2 and 4 in use, see <http://synodinresistance.org/ > cgi-bin/anazetesis?all=1&and-glossa=&and-morphe=&g=en&q=thing> > (which I’d like to switch to using KinoSearch, because it’s > currently too slow). I admire the sophistication of the excerpting provided. Kudos. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|