Gossamer Forum
Home : Products : Gossamer Links : Discussions :

Internal indexing / _tokenize

Quote Reply
Internal indexing / _tokenize
Hi,

while looking on how to index our site I stepped over the tonkenize function.
As far as I can see words like white-wine are not split. That means wine cannot be found because it won´t be indexed.
I cannot see a reason for this at the moment but would be interested in feedback. Furthermore it would be interesting if the internal indexing process could be changed / overwritten with a plugin.

Code:
sub _tokenize {
#--------------------------------------------------------------------------------
# takes a strings and chops it up into little bits
my $self = shift;
my $text = shift;
my ( @words, $i, %rejected, $word, $code );

# split on any non-word (includes accents) characters
@words = split /[^\w\x80-\xFF\-]+/, lc $text;
$self->debug_dumper( "Words: ", \@words ) if ($self->{_debug});

# drop all words that are too small, etc.
$i = 0;
while ( $i <= $#words ) {
$word = $words[ $i ];
if ((exists $self->{stopwords}{$word} and ($code = 'STOPWORD')) or
(length($word) < $self->{min_word_size} and $code = 'TOOSMALL' ) or
(length($word) > $self->{max_word_size} and $code = 'TOOBIG')) {
splice( @words, $i, 1 );
$rejected{$word} = $self->{'rejections'}->{$code};
}
else {
$i++; # Words ok.
}
}
$self->debug_dumper( "Accepted Words: ", \@words ) if ($self->{_debug});
$self->debug_dumper( "Rejected Words: ", \%rejected ) if ($self->{_debug});

return ( \@words, \%rejected );
}
from /cgi-bin/admin/GT/SQL/Search/Base/Common.pm

Thanks

Niko

Last edited by:

el noe: Jul 30, 2009, 2:28 AM
Quote Reply
Re: [el noe] Internal indexing / _tokenize In reply to
Hi,

although I am not really happy with this one it might be interesting for someone. I wanted white-wine and white and wine as searchable words in INTERNAL indexing, so I modified /cgi-bin/admin/GT/SQL/Search/Base/Common.pm

after:
Code:
# drop all words that are too small, etc.
I added:
Code:
#add words for hyphened word
$i = 0;
while ( $i <= $#words ) {
$word = $words[ $i ];$i++;
next unless $word =~ /[\-]/;
my @subwords = split /\-/, $word;
foreach my $subword (@subwords) {push (@words, $subword);}
}
#/add words for hyphened word
I don´t know if there are unwanted side effects with other routines than search so use at your own risk.
Maybe someday somebody from GT can make something out of it.

Regards

Niko