Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: kinosearch: discuss

Dynamic schemas - How?

 

 

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded


melser at gmx

Feb 26, 2007, 11:06 PM

Post #1 of 10 (302 views)
Permalink
Dynamic schemas - How?

Hi Marvin,

I just took a look at KinoSearch 0.20_01 because I've been longing for
this new release.

To my very suprise, I saw that the index structure now is based on
subclassing KinoSearch::Schema::FieldSpec. Well that is a big problem
for users like me which dynamically create indexes based on columns in
our sql-tables which can be flagged to be indexed. Of course statically
defined subclasses of KinoSearch::Schema or not possible with this setup.

So my question is: how can I define dynamic Schemas in KS 0.20???

Thanks for letting me know because not beeing able to create dynamic
schemas would be a no go for 0.20 and leave me stuck with 0.14 (with
utf-8 patches) and I really hope I'm not in a one-way street now.

Best regards,

Marc


marvin at rectangular

Feb 27, 2007, 1:38 AM

Post #2 of 10 (300 views)
Permalink
Dynamic schemas - How? [In reply to]

On Feb 26, 2007, at 10:30 PM, Marc Elser wrote:

> I just took a look at KinoSearch 0.20_01 because I've been longing
> for this new release.
>
> To my very suprise, I saw that the index structure now is based on
> subclassing KinoSearch::Schema::FieldSpec. Well that is a big
> problem for users like me which dynamically create indexes based on
> columns in our sql-tables which can be flagged to be indexed. Of
> course statically defined subclasses of KinoSearch::Schema or not
> possible with this setup.

Maybe not, but you can simulate them, because Perl is dynamic.

> how can I define dynamic Schemas in KS 0.20???

At index time, it's possible, though kludgy.

for my $field_name (@field_names) {
eval qq|
package MySchema::$field_name;
use base qw( KinoSearch::Schema::Field );
|;
die $@ if $@;
}
MySchema->init_fields(@field_names);

That's essentially what I'm doing in my provisional implementation of
KinoSearch::Simple.

The bigger problem in your case is what to do at search time. KS no
longer stores information about what fields are indexed, analyzed,
stored, anything -- all that information is communicated via the
Schema. All that gets stored as far as field defs go is a per-
segment field-name-to-field-num mapping.

To kludge up a search-time Schema, you could maybe write a file with
the field names in it to the index directory, then read that file and
generate your Schema subclass on the fly at search-time, too. Not
the most elegant solution, but should be usable, no?

The eventual plan is to improve the situation over what exists in KS
0.15. Right now I have to dedicate most of my devel time to certain
large-scale performance optimizations, but here's some of what I have
in mind...

[ ... ]

OK, the rationale behind Schema got too long so I offloaded it to a
separate email.

[ ... ]

The next feature I'd planned to add to KinoSearch's Schema API is
something called DeepFieldSpec. It would allow KS to fake one-to-
many relationships by applying a common FieldSpec to class names
which share a common prefix.

Maybe we can bend that concept into something that fits your needs.

You don't know the field names in advance at index-time, but you must
know exactly how you're going to define the fields -- otherwise, you
couldn't make this work with KS 0.1x. So we have a field spec. We
just need to associate it with field names.

Are there multiple specs?

Do they ever change?

Do you ever need to add fields in the middle of an indexing session
or do you know them all up front?

What we probably need is a new KinoSearch::Schema class method, akin
to init_fields() but with one more layer of indirection. Instead of
telling your Schema about a field, you tell it about a FieldSpec
subclass and one or more field names. Are you with me? Could that
work for you?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


melser at gmx

Feb 27, 2007, 4:45 AM

Post #3 of 10 (301 views)
Permalink
Dynamic schemas - How? [In reply to]

Hi Marvin,

> At index time, it's possible, though kludgy.
>
> for my $field_name (@field_names) {
> eval qq|
> package MySchema::$field_name;
> use base qw( KinoSearch::Schema::Field );
> |;
> die $@ if $@;
> }
> MySchema->init_fields(@field_names);
>
> That's essentially what I'm doing in my provisional implementation of
> KinoSearch::Simple.
Well, that was also the best I could come up with, and as I do have the
field_names to be indexed/searched also in the searcher I could do the
same evals and also create the KinoSearch::Schema::Field packages with eval.

But I agree that this solution is not very nice, it's just some kind of
workaround.

> You don't know the field names in advance at index-time, but you must
> know exactly how you're going to define the fields -- otherwise, you
> couldn't make this work with KS 0.1x. So we have a field spec. We just
> need to associate it with field names.
>
> Are there multiple specs?
Yes there are multiple specs because I have multiple indexes.
>
> Do they ever change?
Well, they do change occasionally and then the index with the changed
field is beeing rebuilt.
>
> Do you ever need to add fields in the middle of an indexing session or
> do you know them all up front?
I know them upfront because they're defined in an xml which is parsed,
but they never change in the middle of indexing. You have to re-start
our application which means the config files get parsed and
Apache/mod_perl is restarted.
>
> What we probably need is a new KinoSearch::Schema class method, akin to
> init_fields() but with one more layer of indirection. Instead of
> telling your Schema about a field, you tell it about a FieldSpec
> subclass and one or more field names. Are you with me? Could that
> work for you?
If I understand this right, you wouldn't use any
KinoSearch::Schema::FieldSpec classes anymore but instead you set it up
with a KinoSearch::Schema subclass through a class method which defines
the fields.

This still leaves me with the problem you can not only specify the
fields you want to index in our config-xml but also the indexes you want
to create. So I would also have to define the KinoSearch::Schema classes
through an eval, but it would at least save me another eval for setting
up the fields. But maybe you also know of a better solution for the
subclassing problem for every index.

Best regards,

Marc


marvin at rectangular

Feb 27, 2007, 1:20 PM

Post #4 of 10 (300 views)
Permalink
Dynamic schemas - How? [In reply to]

On Feb 27, 2007, at 4:12 AM, Marc Elser wrote:
>> package MySchema::$field_name;
>> use base qw( KinoSearch::Schema::Field );

(whoops, "Field" should have been "FieldSpec" -- the class has been
renamed since I last did work on KS::Simple.)

> Yes there are multiple specs because I have multiple indexes.

OK. Would it be feasible to create static Schemas that know
everything except for the field names?

Here's how the KS 0.20 API could change to accommodate your needs.

# MySchema.pm
package UnAnalyzedFieldSpec;
use base qw( KinoSearch::Schema::FieldSpec );
sub analyzed {0}

package MySchema;
use base qw( KinoSearch::Schema );
sub analyzer {
KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' );
}

# invindexer.plx
MySchema->init_field( title => 'KinoSearch::Schema::FieldSpec');
MySchema->init_field( content => 'KinoSearch::Schema::FieldSpec');
MySchema->init_field( url => 'UnAnalyzedFieldSpec');

my $invindexer = KinoSearch::InvIndexer->new(
invindex => MySchema->clobber('/path/to/invindex'),
);

That's closer to the earlier KS design, and I think ought to work for
you. Yes?

Most people would put the calls to init_field() in MySchema.pm, but
in your case you would defer them until your scripts.

FWIW, this design breaks with the ORM model which Schema was loosely
derived from, since there's no longer a one-to-one mapping between
fields in the index and classes directly below the Schema subclass's
package. Not that that's a problem.

>> Do they ever change?
> Well, they do change occasionally and then the index with the
> changed field is beeing rebuilt.

The only problem that would arise is if you change up the FieldSpec
subclass after data has been indexed using it, then try to search or
modify the index. The Schema architecture will always have that
vulnerability.

But we already had the same problem with Analyzers: with KS 0.15, if
you switch up an analyzer between index-time and search-time, you get
garbage.

>> Do you ever need to add fields in the middle of an indexing
>> session or do you know them all up front?
> I know them upfront because they're defined in an xml which is
> parsed, but they never change in the middle of indexing.

OK. I'd like to accommodate people who want to add new fields in the
middle of an indexing session, too. That's hard, but maybe it's
possible.

The big tradeoff is that if fields aren't limited to a known, finite
set, a lot of validation has to be turned off.

For instance, InvIndexer->delete_by_term verifies that the field in
question 1) is known, and 2) is spec'd as indexed. If it isn't
known, you probably misspelled it; if it wasn't indexed, no docs will
be found and no deletions will occur -- and that's something you
probably want to know about.

But if the fact that a Schema doesn't know about a field doesn't mean
anything, then we have to accept silent failure in both those cases.

> This still leaves me with the problem you can not only specify the
> fields you want to index in our config-xml but also the indexes you
> want to create. So I would also have to define the
> KinoSearch::Schema classes through an eval, but it would at least
> save me another eval for setting up the fields.

Hmm. Technically, you don't need an eval -- you can manipulate @ISA
directly if you turn off strict refs.

{
no strict 'refs';
@{ $class . '::ISA' } = ('KinoSearch::Schema');
}

I'd consider either that or an eval acceptable if we can't figure out
a better way to handle things, but it's still somewhat inelegant.

If Schemas were objects rather than classes -- which is something I
considered -- we wouldn't have this problem.

my $schema = KinoSearch::Schema->new(
analyzer => KinoSearch::Analysis::Tokenizer->new,
);
$schema->spec_field( name => title );
my $invindexer = KinoSearch::InvIndexer->new(
invindex => $schema->clobber('/path/to/invindex'),
);

However, I rejected that design because I know if we did that, less
experienced users would copy and paste the schema code between index
and search scripts, violating DRY and leading to a bunch of nasty
errors when conflicts arise because copies get out of sync. Shunting
everyone into module use is less error prone and encourages good
programming practice. However, your particular use case is less well
served.

In Perl, which allows you to create classes on the fly, we can still
pull it off. A less dynamic language might not be able to...

> But maybe you also know of a better solution for the subclassing
> problem for every index.

OK... new thought.... how about allowing instances of your Schema
subclass to add fields?

# invindexer.plx
my $schema = MySchema->new;
$schema->add_field( title => 'KinoSearch::Schema::FieldSpec');
$schema->add_field( content => 'KinoSearch::Schema::FieldSpec');
$schema->add_field( url => 'UnAnalyzedFieldSpec');

my $invindexer = KinoSearch::InvIndexer->new(
invindex => $schema->clobber('/path/to/invindex'),
);

init_field() would be a class method only. Fields so registered
would serve as the starter set for each instance.

add_field() would be an instance method only. Fields so registered
would only be known to the object it was called upon.

Hmm... add_field() actually solves another problem. It allows us to
record a mapping of field name to FieldSpec class name in
segments_XXX.yaml, then either validate the mapping against the
schema used to open the file, or register new mappings on an instance
without polluting class variables.

You're lucky in that you know all the field names at search time, so
you can create the Schema on the fly then, too. But say you
didn't... having $schema->open('/path/to/invindex') call add_field()
and build your field list for you solves that problem.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


melser at gmx

Feb 27, 2007, 11:00 PM

Post #5 of 10 (300 views)
Permalink
Dynamic schemas - How? [In reply to]

Hi Marvin,
> For instance, InvIndexer->delete_by_term verifies that the field in
> question 1) is known, and 2) is spec'd as indexed. If it isn't known,
> you probably misspelled it; if it wasn't indexed, no docs will be found
> and no deletions will occur -- and that's something you probably want to
> know about.
Of course it would be nice to know, but if you would use the add_field
method described at the end of the posting, this would be solved too
since then the index knows about it's fields, right? Well you have to
open the index first, but you have to do this anyway if you want to add
or delete anything.

> Hmm. Technically, you don't need an eval -- you can manipulate @ISA
> directly if you turn off strict refs.
>
> {
> no strict 'refs';
> @{ $class . '::ISA' } = ('KinoSearch::Schema');
> }
>
> I'd consider either that or an eval acceptable if we can't figure out a
> better way to handle things, but it's still somewhat inelegant.
Yeah, for the moment I also see no better Solution as eval or
manupulating ISA directly.

> However, I rejected that design because I know if we did that, less
> experienced users would copy and paste the schema code between index and
> search scripts, violating DRY and leading to a bunch of nasty errors
> when conflicts arise because copies get out of sync. Shunting everyone
> into module use is less error prone and encourages good programming
> practice. However, your particular use case is less well served.
Well, I agree that objects rather than classes would serve my case
better but could be difficult to manage if you have multiple search and
indexer objects around.

> OK... new thought.... how about allowing instances of your Schema
> subclass to add fields?
>
> # invindexer.plx
> my $schema = MySchema->new;
> $schema->add_field( title => 'KinoSearch::Schema::FieldSpec');
> $schema->add_field( content => 'KinoSearch::Schema::FieldSpec');
> $schema->add_field( url => 'UnAnalyzedFieldSpec');
>
> my $invindexer = KinoSearch::InvIndexer->new(
> invindex => $schema->clobber('/path/to/invindex'),
> );
>
> init_field() would be a class method only. Fields so registered would
> serve as the starter set for each instance.
>
> add_field() would be an instance method only. Fields so registered
> would only be known to the object it was called upon.
>
> Hmm... add_field() actually solves another problem. It allows us to
> record a mapping of field name to FieldSpec class name in
> segments_XXX.yaml, then either validate the mapping against the schema
> used to open the file, or register new mappings on an instance without
> polluting class variables.
This design seems very elegant to me and as you say, it solves a lot of
problems.
>
> You're lucky in that you know all the field names at search time, so you
> can create the Schema on the fly then, too. But say you didn't...
> having $schema->open('/path/to/invindex') call add_field() and build
> your field list for you solves that problem.
Of course, if I have the possibilty to just open the index and know
about the fields used, I could dump my code which builds the schema
on-the-fly which would be great because it would simplify the searcher code.

Best regards,

Marc


marvin at rectangular

Mar 1, 2007, 9:01 AM

Post #6 of 10 (301 views)
Permalink
Dynamic schemas - How? [In reply to]



marvin at rectangular

Mar 1, 2007, 10:53 AM

Post #7 of 10 (300 views)
Permalink
Dynamic schemas - How? [In reply to]

(Apologies for the previous empty message, courtesy of an errant
mouse click).

Marc,

Thanks very much for speaking up. It's possible to revise Schema now
with no real penalty beyond the breakage 0.20 introduces.

If anyone else has any suggestions (: or gripes :) about KinoSearch's
API, NOW is the time to make them known.

After your feedback about Schema, I've decided to give it a minor
overhaul.

As currently implemented, init_fields() doesn't do much except
generate a hash and perform some verification. If we move the
verification routines to the constructor, then we can just replace
init_fields() with a required variable, %FIELDS.

our %FIELDS = (
title => 'KinoSearch::Schema::FieldSpec,
content => 'KinoSearch::Schema::FieldSpec',
url => 'UnAnalyzedFieldSpec',
);

add_field() will work as described earlier -- it will be an instance
method only.

my $schema = MySchema->new;
$schema->add_field( $_ => 'CustomSpec' ) for @dynamic_fields;

I think that's a better API. add_field() and init_field() were
confusingly similar. Now there will be no confusion as to what's
class data and what's instance data.

I've also decided to try to make it possible to call add_field() at
any time during indexing.

my $schema = MySchema->new;
my $invindexer = KinoSearch::InvIndexer->new(
invindex => $schema->open('/path/to/invindex'),
);
while ( my $doc = get_doc_hashref_from_somewhere() ) {
$schema->add_field( $_ => 'CustomSpec' ) for keys %$doc;
$invindexer->add_doc($doc);
}

Adding the same field name over and over again with add_field() won't
be an error, unless you try to switch up the FieldSpec subclass it's
associated with -- once a field is associated with a given classname,
it's forever as far as that Schema object and that invindex.

This is a substantial change in how KinoSearch thinks about the index
structure, and we'll have to sacrifice some validation here and
there. But the nice thing is that it won't be necessary to add a
DeepFieldSpec class -- people who need to fake one-to-many
relationships will be able to hack that up on their own.

>> For instance, InvIndexer->delete_by_term verifies that the field
>> in question 1) is known, and 2) is spec'd as indexed. If it isn't
>> known, you probably misspelled it; if it wasn't indexed, no docs
>> will be found and no deletions will occur -- and that's something
>> you probably want to know about.
> Of course it would be nice to know, but if you would use the
> add_field method described at the end of the posting, this would be
> solved too since then the index knows about it's fields, right?
> Well you have to open the index first, but you have to do this
> anyway if you want to add or delete anything.

The thing is, you might not know whether or not you've added a given
field to your schema. I could add a has_field() instance method to
Schema, but I doubt people will go to the trouble of using it. :) So
we just have to live with the mildly reduced default level of safety.

> Of course, if I have the possibilty to just open the index and know
> about the fields used, I could dump my code which builds the schema
> on-the-fly which would be great because it would simplify the
> searcher code.

I'll take care of it. :) New documentation for Schema->open...

=head2 open

my $invindex = MySchema->open('/path/to/invindex');

Open an existing invindex for either reading or updating. All fields
which
have ever been defined for this invindex will be loaded/verified via
add_field().

=cut

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


mcrawfor at u

Mar 2, 2007, 3:25 PM

Post #8 of 10 (300 views)
Permalink
Dynamic schemas - How? [In reply to]

I too have been looking into the schemas issue with some dismay.

The idea of class-based schemas seems to me to violate many of the ideals of
OO-perl programming, and all of the solutions presented here have struck me as
closer to hacks than to a successful application of an API.


> Hmm. Technically, you don't need an eval -- you can manipulate @ISA directly
> if you turn off strict refs.
>
> {
> no strict 'refs';
> @{ $class . '::ISA' } = ('KinoSearch::Schema');
> }
> I'd consider either that or an eval acceptable if we can't figure out a
> better way to handle things, but it's still somewhat inelegant.

My feeling is that many large, extensible applications will be working with
dynamic objects, and any API that requires the use of eval'ed code or "no
strict refs" is probably not ideal.


> If Schemas were objects rather than classes -- which is something I
> considered -- we wouldn't have this problem.
>
> my $schema = KinoSearch::Schema->new(
> analyzer => KinoSearch::Analysis::Tokenizer->new,
> );
> $schema->spec_field( name => title );
> my $invindexer = KinoSearch::InvIndexer->new(
> invindex => $schema->clobber('/path/to/invindex'),
> );

This solution seems very sane and useful to me! It's clean, easy to document
and easy to use either in a static or dynamic context.

> However, I rejected that design because I know if we did that, less
> experienced users would copy and paste the schema code between index and
> search scripts, violating DRY and leading to a bunch of nasty errors when
> conflicts arise because copies get out of sync. Shunting everyone into
> module use is less error prone and encourages good programming practice.
> However, your particular use case is less well served.

This does not strike me as a particularly reasonable objection. An
inexperienced user is just as likely to copy-paste some complex looped-over
eval or a class with a %FIELDS hash as they are an object oriented approach.
Perl package black magic sets the bar even higher for inexperienced folk.

In short, I love the Schema concept, but please, do not tie it up in classes -
90% of the customization that people want to do will be easy to add via
instance methods on objects, and for the other 10% subclassing could certainly
still be optional.

Thanks for letting us take a look at .20_01 and all your work on KinoSearch.

-Miles
__________________________________
Miles Crawford, Software Developer
Catalyst Research & Development
Office of Learning & Scholarly Technologies
University of Washington
206.616.3406

http://catalyst.washington.edu
http://solstice.eplt.washington.edu


marvin at rectangular

Mar 2, 2007, 4:50 PM

Post #9 of 10 (300 views)
Permalink
Dynamic schemas - How? [In reply to]

On Mar 2, 2007, at 2:49 PM, Miles Crawford wrote:

> My feeling is that many large, extensible applications will be
> working with dynamic objects,

Well, I've just accommodated them. :)

As of repository revision 2107, you can now add fields to a schema
instance on the fly in the middle of indexing.

> and any API that requires the use of eval'ed code or "no strict
> refs" is probably not ideal.

Presto! Gone. No longer necessary.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


mcrawfor at u

Mar 2, 2007, 5:09 PM

Post #10 of 10 (300 views)
Permalink
Dynamic schemas - How? [In reply to]

Amazing! It seemed that the thread was leading in that direction, and I'm
glad I could help it keep rolling along.

Thank you again for KinoSearch,

-Miles



On Fri, 2 Mar 2007, Marvin Humphrey wrote:

>
> On Mar 2, 2007, at 2:49 PM, Miles Crawford wrote:
>
>> My feeling is that many large, extensible applications will be working with
>> dynamic objects,
>
> Well, I've just accommodated them. :)
>
> As of repository revision 2107, you can now add fields to a schema instance
> on the fly in the middle of indexing.
>
>> and any API that requires the use of eval'ed code or "no strict refs" is
>> probably not ideal.
>
> Presto! Gone. No longer necessary.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch [at] rectangular
> http://www.rectangular.com/mailman/listinfo/kinosearch

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.