marvin at rectangular
Sep 29, 2007, 9:01 AM
Post #1 of 11
On Sep 7, 2007, at 1:24 PM, Nathan Kurz wrote:
> On 9/7/07, Marvin Humphrey <marvin [at] rectangular> wrote:
>> My main goal with serializing Schema is to make the invindex file
>> format self-describing, so that it becomes possible to read one
>> without the need for any auxiliary information.
> Thanks for the explanation. I understand better now.
> I think I agree with all of that, with the small exception that I
> don't think you gain much by procedurally specify the tokenizer. I
> think specifying it as
> "tokenizer: whitespace" and letting the reader handle the
> implementation is wiser than specifying a split on "\S+".
There are a couple of ways to do that.
We could have the Tokenizer constructor accept a "type" parameter,
which would then map to a particular implementation. But that
offers no advantage over a second, clearer option...
We could create a suite of officially sanctioned Tokenizer classes.
WhitespaceTokenizer, WordCharTokenizer, etc. However... few of these
are truly useful, and just about all the useful ones can be
implemented using a regex-based Tokenizer. Indeed, that is why KS
has only one Tokenizer class, while Lucene has several. A single
regex-based Tokenizer like the one we have now offers the greatest
combination of flexibility, power, and simplicity of implementation.
The problem we face now, though, is how to specify the token_re
By the way, I suspect it was only a brain-hiccup on your part, but
specifying a token_re of "\S+" is not the same as a split -- it's
actually the inverse. The regex is used to match the tokens
themselves rather than the separators between tokens.
> If you are trying to be language-agnostic, requiring the reader to be
> able to handle what could be arbitrary expressions in a particular
> regexp language seems onerous, even if it is a pretty standard one.
I thought about this for a while. Perl-compatible regular expression
syntax is very widespread. There isn't an official standard which
freezes the syntax a la POSIX, which is somewhat dissatisfying
because it means you can't guarantee compliance. But for practical
purposes, common regexes ought to be portable.
> In particular, I can see wanting a straight C implementation using
> flex rather than a regexp library.
> I don't feel strongly about this, though, since if one really wants to
> do this one could just do it non-portably.
Yes. A flex-based tokenizer for a C implementation would be cool, it
just wouldn't be part of the official list of blessed Analyzers.
Someone could release a KSx version to CPAN if so motivated -- but
back-compat issues would be handled as an independent project outside
of the KS core.
KinoSearch mailing list
KinoSearch [at] rectangular