
marvin at rectangular
Mar 15, 2007, 3:45 PM
Post #3 of 21
(1353 views)
Permalink
|
On Mar 15, 2007, at 9:21 AM, Roger Dooley wrote: > I've just started working with the devel release and have modified > my indexer for 0.15 to the new model. The document set is rather > large (+.5 million) and indexing this took many hours with the 0.15 > release. However, with 0.20, I haven't been able to index the files > as the indexing seems to be taking days and I end up killing the > process and looking at the code again. At least some of the slowdown is a side effect of UTF-8 compatibility in 0.20. Tokenizer is a major offender, and the bottleneck is Perl's UTF-8 character class regex implementation. I'm a little surprised by the scale, though. According to my benchmarking tests, we'd taken about a 35% hit, going from around 3.1 seconds under 0.15 to around 4.2 seconds for 0.20. We actually lost a lot more than that with the transition to UTF-8, but I've continued to make strides optimizing the engine -- if you take Tokenizer out of the loop, and use a purpose-built C tokenizer instead (the ASCIIWhiteSpaceTokenizer in devel/benchmarks/BenchMarkingIndexer.pm), 0.20 is actually 30% *faster* than 0.15, at 1.82 secs vs 2.62 secs. However, my benchmarker script only uses a Tokenizer. If your analyzer incorporates a Stemmer or a Stopalizer, there may be additional drags I hadn't been measuring. Stemmer seems like a more likely culprit, since that's changed to UTF-8 and I don't know how UTF-8 Snowball performs in comparison to Latin-1 Snowball. Stopalizer is also a possibility, but I'm not sure that hash lookups are slower under UTF-8 -- I wouldn't think so. LCNormalizer is almost certainly slower, but I wouldn't guess it would affect things too much since it only hits the string once. Here are some stats originally compiled for a post I made to the Perl 5 Porters list: <http://www.nntp.perl.org/group/perl.perl5.porters/ 2007/02/msg121014.html> ================================================================== Mean time to index 1000 ASCII news articles ------------------------------------------------------------------ tokenizer 5.8.6 (thr) 5.8.8 (no thr) blead (no thr) ------------------------------------------------------------------ UTF-8 regex 4.18 secs 3.72 secs 3.80 secs Latin-1 regex 2.84 secs 2.50 secs 2.60 secs Purpose-built C 1.82 secs 1.60 secs 1.64 secs It turns out that Perl's current UTF-8 char-class implementation is sub-optimal. Yves Orton (a.k.a. demerphq) and I have had some preliminary discussions about how to go about improving it. Yves has actually made the regex engine pluggable in blead; what may happen eventually is that after 5.10 comes out I'll hack up a slightly tweaked version of the regex engine which (only) Tokenizer will use. I'd actually love to go in and hack on Perl's regex engine right now, and the work to implement char classes in terms of "inversion lists" probably isn't insane (bwa ha ha). However, I haven't done so because 1) I'd have to invest some time to come up to speed on the gory details of the regex engine, and 2) KinoSearch's indexing performance has been good enough up till now that it's been more important to work on other features. It may be time to make another stab at moving the Tokenizer loop to C. while (/$token_re/g) { push @starts, $-[0]; push @ends, $+[0]; } The first time I tried that preceded Yves' exposing and documenting of the regex engine API: <http://search.cpan.org/~rgarcia/perl-5.9.4/ pod/perlreguts.pod>. With the aid of the new docs, I can probably figure things out for blead, then backport for 5.8.x. There are significant inefficiencies in how @- and @+ are retrieved under UTF-8 -- they calculate UTF-8 length every time -- and that's damned inefficient if you're doing it for every token. (This happens in the function Perl_magic_regdatum_get() in mg.c.). If I can run the loop in C, I can get at the original numbers from the regex engine struct and avoid that. If you don't want to wait for me to complete this work and you have Inline C skillz, you might try carving up your own Tokenizer based on ASCIIWhiteSpaceTokenizer. Otherwise, if you (or anybody else) wants to help me out, I could use some benchmarking numbers with various configs. Time I spend doing the benchmarking (which other people can do) is time I don't spend rooting around in the scariest crags of Perl and KinoSearch C code (which not many other people are going to be able to do). Different Analyzers would be very helpful. So would long vs. short source strings. Hope this long winded reply helps you -- composing it helped me. Cheers, Marvin Humphrey Rectangular Research http://www.rectangular.com/
|