
marvin at rectangular
Jul 14, 2008, 10:07 PM
Post #2 of 3
(912 views)
Permalink
|
On Jul 14, 2008, at 6:18 AM, Riyaad Miller wrote: > I'm using KS 0.162. When using the following code, the error below > is produced: > > My Definitions > my $stemmer = KinoSearch::Analysis::Stemmer->new( language => > 'en' ); > my $stopalizer = KinoSearch::Analysis::Stopalizer->new(language => > 'en'); > my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new(analyzers > => [$stemmer, $stopalizer]); > > The Error > Maximum token length is 65535; got 107462 You have a PolyAnalyzer which contains a Stemmer and a Stopalizer, but not a Tokenizer. Thus, the entire field value, all 107462 characters of it, is the only token. Theoretically, if KS had completed indexing successfully rather than choked on that value, and at search-time someone were to type in the appropriate 100,000+ character search string, you might get a hit. Whatever those 107462 characters are, I can guarantee you that nothing that long exists in the english stop list. Similarly, I doubt the Stemmer has anything useful to say about the last few characters of that field. You really need a Tokenizer. You probably also want an LCNormalizer in there unless you really want searches to be case sensitive. my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new; my $tokenizer = KinoSearch::Analysis::Tokenizer->new; my $stemmer = KinoSearch::Analysis::Stemmer->new( language => 'en', ); my $stopalizer = KinoSearch::Analysis::Stopalizer->new( language => 'en', ); my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( analyzers => [ $lc_normalizer, $tokenizer, $stopalizer, $stemmer ], ); Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|