rwmaillists at googlemail
Jul 24, 2012, 6:05 PM
Post #6 of 6
On Tue, 24 Jul 2012 08:36:53 -0400
Re: spamassassin bayesian training on foreign characters
[In reply to]
David F. Skoll wrote:
> On Tue, 24 Jul 2012 09:41:19 +0200
> Simon Loewenthal <simon [at] klunky> wrote:
> > I have Bayes correctly scoring BAYES_99 on Dutch and French
> > straight out of the box. No problems. --
Dutch, French etc are very similar to English with most characters being
compatible with ascii.
> It does work, but with a caveat: SpamAssassin does not normalize the
> character set. So if you train it on Chinese in the GB2312 character
> set, that will do nothing for you if you receive UTF-8 Chinese spam.
> Furthermore, if some random character set A and another random
> character set B share byte sequences, your Bayes training may confuse
> Also, I don't believe SpamAssassin has any type of logic for
> recognizing word boundaries in ideographic character sets vs.
> alphabetic ones.
There's also a problem with non-roman alphabets represented with
multibyte characters whereby the maximum token length (15) is hit on
relatively short words. There is some attempt to work around this by
converting such tokens into byte pairs.
> Bayes is pretty robust, so it "works" in the face of a lot of noise,
> but SA's implementation still leaves quite a bit to be desired.
In most spams aimed at English speakers, spammers avoid leaving any
useful tokens in the text and Bayes still works with headers and