
bugzilla-daemon at bugzilla
Oct 29, 2009, 12:41 AM
Post #1 of 1
(176 views)
Permalink
|
|
[Bug 6229] New: TextCat is too case sensitive
|
|
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6229 Summary: TextCat is too case sensitive Product: Spamassassin Version: SVN Trunk (Latest Devel Version) Platform: All OS/Version: All Status: NEW Severity: normal Priority: P5 Component: Plugins AssignedTo: dev [at] spamassassin ReportedBy: hege [at] hege Created an attachment (id=4562) --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4562) TextCat problem sample It seems the languages database is case sensitive. For example, all uppercase english spams get very wonky results. I have no idea what the best way to fix this would be, I'm using a quick fix like this to get better results.. --- TextCat.pm.orig 2009-10-29 09:23:46.985152046 +0200 +++ TextCat.pm 2009-10-29 09:24:38.339651987 +0200 @@ -440,6 +440,7 @@ # my $non_word_characters = qr/[0-9\s]/; for my $word (split(/[0-9\s]+/, ${$_[0]})) { + $word =~ tr/A-ZÖÄÅ/a-zöäå/ if $word =~ /[a-zA-ZöäåÖÄÅ]{4}/; $word = "\000" . $word . "\000"; my $len = length($word); my $flen = $len; Attached is a sample message. Running it with textcat_max_languages 20 gives us: ja.iso-2022-jp de zh.big5 sk.windows-1250 id sk.us-ascii cs.iso-8859-2 ca da vi sw ms tl ne pl Running it with my fix gives the expected single "en". -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug.
|