
marvin at rectangular
Feb 27, 2007, 1:20 PM
Post #4 of 10
(300 views)
Permalink
|
On Feb 27, 2007, at 4:12 AM, Marc Elser wrote: >> package MySchema::$field_name; >> use base qw( KinoSearch::Schema::Field ); (whoops, "Field" should have been "FieldSpec" -- the class has been renamed since I last did work on KS::Simple.) > Yes there are multiple specs because I have multiple indexes. OK. Would it be feasible to create static Schemas that know everything except for the field names? Here's how the KS 0.20 API could change to accommodate your needs. # MySchema.pm package UnAnalyzedFieldSpec; use base qw( KinoSearch::Schema::FieldSpec ); sub analyzed {0} package MySchema; use base qw( KinoSearch::Schema ); sub analyzer { KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' ); } # invindexer.plx MySchema->init_field( title => 'KinoSearch::Schema::FieldSpec'); MySchema->init_field( content => 'KinoSearch::Schema::FieldSpec'); MySchema->init_field( url => 'UnAnalyzedFieldSpec'); my $invindexer = KinoSearch::InvIndexer->new( invindex => MySchema->clobber('/path/to/invindex'), ); That's closer to the earlier KS design, and I think ought to work for you. Yes? Most people would put the calls to init_field() in MySchema.pm, but in your case you would defer them until your scripts. FWIW, this design breaks with the ORM model which Schema was loosely derived from, since there's no longer a one-to-one mapping between fields in the index and classes directly below the Schema subclass's package. Not that that's a problem. >> Do they ever change? > Well, they do change occasionally and then the index with the > changed field is beeing rebuilt. The only problem that would arise is if you change up the FieldSpec subclass after data has been indexed using it, then try to search or modify the index. The Schema architecture will always have that vulnerability. But we already had the same problem with Analyzers: with KS 0.15, if you switch up an analyzer between index-time and search-time, you get garbage. >> Do you ever need to add fields in the middle of an indexing >> session or do you know them all up front? > I know them upfront because they're defined in an xml which is > parsed, but they never change in the middle of indexing. OK. I'd like to accommodate people who want to add new fields in the middle of an indexing session, too. That's hard, but maybe it's possible. The big tradeoff is that if fields aren't limited to a known, finite set, a lot of validation has to be turned off. For instance, InvIndexer->delete_by_term verifies that the field in question 1) is known, and 2) is spec'd as indexed. If it isn't known, you probably misspelled it; if it wasn't indexed, no docs will be found and no deletions will occur -- and that's something you probably want to know about. But if the fact that a Schema doesn't know about a field doesn't mean anything, then we have to accept silent failure in both those cases. > This still leaves me with the problem you can not only specify the > fields you want to index in our config-xml but also the indexes you > want to create. So I would also have to define the > KinoSearch::Schema classes through an eval, but it would at least > save me another eval for setting up the fields. Hmm. Technically, you don't need an eval -- you can manipulate @ISA directly if you turn off strict refs. { no strict 'refs'; @{ $class . '::ISA' } = ('KinoSearch::Schema'); } I'd consider either that or an eval acceptable if we can't figure out a better way to handle things, but it's still somewhat inelegant. If Schemas were objects rather than classes -- which is something I considered -- we wouldn't have this problem. my $schema = KinoSearch::Schema->new( analyzer => KinoSearch::Analysis::Tokenizer->new, ); $schema->spec_field( name => title ); my $invindexer = KinoSearch::InvIndexer->new( invindex => $schema->clobber('/path/to/invindex'), ); However, I rejected that design because I know if we did that, less experienced users would copy and paste the schema code between index and search scripts, violating DRY and leading to a bunch of nasty errors when conflicts arise because copies get out of sync. Shunting everyone into module use is less error prone and encourages good programming practice. However, your particular use case is less well served. In Perl, which allows you to create classes on the fly, we can still pull it off. A less dynamic language might not be able to... > But maybe you also know of a better solution for the subclassing > problem for every index. OK... new thought.... how about allowing instances of your Schema subclass to add fields? # invindexer.plx my $schema = MySchema->new; $schema->add_field( title => 'KinoSearch::Schema::FieldSpec'); $schema->add_field( content => 'KinoSearch::Schema::FieldSpec'); $schema->add_field( url => 'UnAnalyzedFieldSpec'); my $invindexer = KinoSearch::InvIndexer->new( invindex => $schema->clobber('/path/to/invindex'), ); init_field() would be a class method only. Fields so registered would serve as the starter set for each instance. add_field() would be an instance method only. Fields so registered would only be known to the object it was called upon. Hmm... add_field() actually solves another problem. It allows us to record a mapping of field name to FieldSpec class name in segments_XXX.yaml, then either validate the mapping against the schema used to open the file, or register new mappings on an instance without polluting class variables. You're lucky in that you know all the field names at search time, so you can create the Schema on the fly then, too. But say you didn't... having $schema->open('/path/to/invindex') call add_field() and build your field list for you solves that problem. Marvin Humphrey Rectangular Research http://www.rectangular.com/
|