Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: users

spamassassin bayesian training on foreign characters

 

 

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded


david.kentwood at gmail

Jul 23, 2012, 3:39 PM

Post #1 of 6 (691 views)
Permalink
spamassassin bayesian training on foreign characters

Hello,

I get a lot of foreign spams (eg. chinese, russian, etc) and am thinking of
training spamassassin to identify such spams. My questions are:

1) can a stock install of spamassassin recognize foreign characters without
special configurations?

2) how well does Bayesian training work on foreign spams?

Thanks for any advice on this matter.

Dave


jhardin at impsec

Jul 23, 2012, 5:31 PM

Post #2 of 6 (667 views)
Permalink
Re: spamassassin bayesian training on foreign characters [In reply to]

On Mon, 23 Jul 2012, David Kentwood wrote:

> Hello,
>
> I get a lot of foreign spams (eg. chinese, russian, etc) and am thinking of
> training spamassassin to identify such spams. My questions are:
>
> 1) can a stock install of spamassassin recognize foreign characters without
> special configurations?

Yes.

> 2) how well does Bayesian training work on foreign spams?

Quite well here. I have trained it on chinese, portuguese and spanish and
it always hits BAYES_99 on such.

> Thanks for any advice on this matter.

There shouldn't be anything special about the language w/r/t bayes.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin [at] impsec FALaholic #11174 pgpk -a jhardin [at] impsec
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Gun Control laws cannot reduce violent crime, because gun control
laws focus obsessively on a tool a criminal might use to commit a
crime rather than the criminal himself and his act of violence.
-----------------------------------------------------------------------
13 days until the rover Curiosity lands on Mars


simon at klunky

Jul 24, 2012, 12:41 AM

Post #3 of 6 (673 views)
Permalink
Re: spamassassin bayesian training on foreign characters [In reply to]

Hi

I have Bayes correctly scoring BAYES_99 on Dutch and French straight out of the box. No problems.
--
Dogs are tough.
I've been interrogating this one for hours and he still won't tell me who's a good boy.
simon [at] klunk / .co.uk / .org

John Hardin <jhardin [at] impsec> wrote:

>On Mon, 23 Jul 2012, David Kentwood wrote:
>
>> Hello,
>>
>> I get a lot of foreign spams (eg. chinese, russian, etc) and am
>thinking of
>> training spamassassin to identify such spams. My questions are:
>>
>> 1) can a stock install of spamassassin recognize foreign characters
>without
>> special configurations?
>
>Yes.
>
>> 2) how well does Bayesian training work on foreign spams?
>
>Quite well here. I have trained it on chinese, portuguese and spanish
>and
>it always hits BAYES_99 on such.
>
>> Thanks for any advice on this matter.
>
>There shouldn't be anything special about the language w/r/t bayes.


david.kentwood at gmail

Jul 24, 2012, 1:05 AM

Post #4 of 6 (673 views)
Permalink
Re: spamassassin bayesian training on foreign characters [In reply to]

Thanks for the replies. It's good to have some confirmations!

On Tue, Jul 24, 2012 at 3:41 AM, Simon Loewenthal <simon [at] klunky>wrote:

> Hi
>
> I have Bayes correctly scoring BAYES_99 on Dutch and French straight
> out of the box. No problems.
> --
> Dogs are tough.
> I've been interrogating this one for hours and he still won't tell me
> who's a good boy.
> simon [at] klunk / .co.uk / .org
>
> John Hardin <jhardin [at] impsec> wrote:
>
> >On Mon, 23 Jul 2012, David Kentwood wrote:
> >
> >> Hello,
> >>
> >> I get a lot of foreign spams (eg. chinese, russian, etc) and am
> >thinking of
> >> training spamassassin to identify such spams. My questions are:
> >>
> >> 1) can a stock install of spamassassin recognize foreign characters
> >without
> >> special configurations?
> >
> >Yes.
> >
> >> 2) how well does Bayesian training work on foreign spams?
> >
> >Quite well here. I have trained it on chinese, portuguese and spanish
> >and
> >it always hits BAYES_99 on such.
> >
> >> Thanks for any advice on this matter.
> >
> >There shouldn't be anything special about the language w/r/t bayes.
>
>


dfs at roaringpenguin

Jul 24, 2012, 5:36 AM

Post #5 of 6 (697 views)
Permalink
Re: spamassassin bayesian training on foreign characters [In reply to]

On Tue, 24 Jul 2012 09:41:19 +0200
Simon Loewenthal <simon [at] klunky> wrote:

> I have Bayes correctly scoring BAYES_99 on Dutch and French
> straight out of the box. No problems. --

It does work, but with a caveat: SpamAssassin does not normalize the
character set. So if you train it on Chinese in the GB2312 character
set, that will do nothing for you if you receive UTF-8 Chinese spam.
Furthermore, if some random character set A and another random
character set B share byte sequences, your Bayes training may confuse
them.

Also, I don't believe SpamAssassin has any type of logic for recognizing
word boundaries in ideographic character sets vs. alphabetic ones.

Bayes is pretty robust, so it "works" in the face of a lot of noise, but
SA's implementation still leaves quite a bit to be desired.

Regards,

David.


rwmaillists at googlemail

Jul 24, 2012, 6:05 PM

Post #6 of 6 (677 views)
Permalink
Re: spamassassin bayesian training on foreign characters [In reply to]

On Tue, 24 Jul 2012 08:36:53 -0400
David F. Skoll wrote:

> On Tue, 24 Jul 2012 09:41:19 +0200
> Simon Loewenthal <simon [at] klunky> wrote:
>
> > I have Bayes correctly scoring BAYES_99 on Dutch and French
> > straight out of the box. No problems. --

Dutch, French etc are very similar to English with most characters being
compatible with ascii.

> It does work, but with a caveat: SpamAssassin does not normalize the
> character set. So if you train it on Chinese in the GB2312 character
> set, that will do nothing for you if you receive UTF-8 Chinese spam.
> Furthermore, if some random character set A and another random
> character set B share byte sequences, your Bayes training may confuse
> them.
>
> Also, I don't believe SpamAssassin has any type of logic for
> recognizing word boundaries in ideographic character sets vs.
> alphabetic ones.

There's also a problem with non-roman alphabets represented with
multibyte characters whereby the maximum token length (15) is hit on
relatively short words. There is some attempt to work around this by
converting such tokens into byte pairs.


> Bayes is pretty robust, so it "works" in the face of a lot of noise,
> but SA's implementation still leaves quite a bit to be desired.

In most spams aimed at English speakers, spammers avoid leaving any
useful tokens in the text and Bayes still works with headers and
mark-up.

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.