Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: users

Understanding SpamAssassin

 

 

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded


abhinav.pathak at gmail

Sep 21, 2009, 12:05 PM

Post #1 of 11 (762 views)
Permalink
Understanding SpamAssassin

I am trying to understand inner workings of spam assassin and would be great
if someone can answer my questions. I have read online documentation but
there are still some questions left unanswered or I am not sure about.

As far as I understand, the default configuration of spamassassin processes
emails in this fashion

DNSBL Tests ---> RAW Body Tests ---> Bayesian Learning --> AWL

[.Is the sequence right? I know for sure AWL comes in last, what about
Bayesian learning and RAW Body tests' order? Did I miss any module?]

Why do we need Bayesian learning in presence of RAW body tests?

Mails which have very high or very low score are fed to bayesian learning.
Since we are confident about them being HAM or SPAM what do we want to learn
from them - The regex filters have identified that the mail is a spam (say),
what additional does bayesian learning achieve? Does it learn other words in
the spam mail (say words surrounding obfuscated term) in hope of matching
them in future emails? Or am I understanding it completely different?

Thnx for help.
--
View this message in context: http://www.nabble.com/Understanding-SpamAssassin-tp25530437p25530437.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Bowie_Bailey at BUC

Sep 21, 2009, 1:08 PM

Post #2 of 11 (717 views)
Permalink
Re: Understanding SpamAssassin [In reply to]

poifgh wrote:
> I am trying to understand inner workings of spam assassin and would be great
> if someone can answer my questions. I have read online documentation but
> there are still some questions left unanswered or I am not sure about.
>

I'm not an expert, just a long-time user, but I can give you some basic
answers.

> As far as I understand, the default configuration of spamassassin processes
> emails in this fashion
>
> DNSBL Tests ---> RAW Body Tests ---> Bayesian Learning --> AWL
>
> [.Is the sequence right? I know for sure AWL comes in last, what about
> Bayesian learning and RAW Body tests' order? Did I miss any module?]
>

As I understand it, quite a bit of this is done in parallel. In
particular, the DNS based tests are fired off first and then other tests
are run while waiting for the response.

In any case, unless you are playing with the shortcut features, all
rules are run for every message, so does it really matter what order
they are in?

> Why do we need Bayesian learning in presence of RAW body tests?
>
> Mails which have very high or very low score are fed to bayesian learning.
> Since we are confident about them being HAM or SPAM what do we want to learn
> from them - The regex filters have identified that the mail is a spam (say),
> what additional does bayesian learning achieve? Does it learn other words in
> the spam mail (say words surrounding obfuscated term) in hope of matching
> them in future emails? Or am I understanding it completely different?
>

For auto-learning, the high and low scoring messages are fed to Bayes.
However, for an optimal setup, you should manually train Bayes on as
much of your (verified) ham and spam as possible. The more of your mail
stream Bayes sees, the better the results will be.

Your description of Bayes is pretty close. It breaks down the message
into "tokens" (words and character sequences) and then keeps track of
how likely each of those tokens is to appear in either a ham or spam
message. When a new message comes in, Bayes breaks it into tokens and
then scores it depending on which tokens were found in the message.

--
Bowie.


abhinav.pathak at gmail

Sep 21, 2009, 1:33 PM

Post #3 of 11 (720 views)
Permalink
Understanding SpamAssassin [In reply to]

I am trying to understand inner workings of spam assassin and would be great
if someone can answer my questions. I have read online documentation but
there are still some questions left unanswered or I am not sure about.

As far as I understand, the default configuration of spamassassin processes
emails in this fashion

DNSBL Tests ---> RAW Body Tests ---> Bayesian Learning --> AWL

[.Is the sequence right? I know for sure AWL comes in last, what about
Bayesian learning and RAW Body tests' order? Did I miss any module?]

Why do we need Bayesian learning in presence of RAW body tests?

Mails which have very high or very low score are fed to bayesian learning.
Since we are confident about them being HAM or SPAM what do we want to learn
from them - The regex filters have identified that the mail is a spam (say),
what additional does bayesian learning achieve? Does it learn other words in
the spam mail (say words surrounding obfuscated term) in hope of matching
them in future emails? Or am I understanding it completely different?

Thnx for help.
--
View this message in context: http://www.nabble.com/Understanding-SpamAssassin-tp25530471p25530471.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


kremels at kreme

Sep 22, 2009, 12:43 AM

Post #4 of 11 (717 views)
Permalink
Re: Understanding SpamAssassin [In reply to]

On 21-Sep-2009, at 13:05, poifgh wrote:
> Mails which have very high or very low score are fed to bayesian
> learning.
> Since we are confident about them being HAM or SPAM what do we want
> to learn
> from them - The regex filters have identified that the mail is a
> spam (say),
> what additional does bayesian learning achieve? Does it learn other
> words in
> the spam mail (say words surrounding obfuscated term) in hope of
> matching
> them in future emails? Or am I understanding it completely different?

Bayes learning from spam helps score message that would not score as
spam. Similarly, bayes learning from ham helps score messages as ham
that might otherwise be tagged as ham.

--
Heisenberg's only uncertainty was what pub to vomit in next and
Jung fancied Freud's mother too. -- Jared Earle


me at junc

Sep 22, 2009, 1:56 AM

Post #5 of 11 (709 views)
Permalink
Re: Understanding SpamAssassin [In reply to]

On tir 22 sep 2009 09:43:23 CEST, LuKreme wrote
> bayes learning from ham helps score messages as
> ham that might otherwise be tagged as ham.

ups :)

--
xpoint


abhinav.pathak at gmail

Sep 24, 2009, 6:44 PM

Post #6 of 11 (687 views)
Permalink
Re: Understanding SpamAssassin [In reply to]

Bowie Bailey wrote:
>
> For auto-learning, the high and low scoring messages are fed to Bayes.
> However, for an optimal setup, you should manually train Bayes on as
> much of your (verified) ham and spam as possible. The more of your mail
> stream Bayes sees, the better the results will be.
>
> Your description of Bayes is pretty close. It breaks down the message
> into "tokens" (words and character sequences) and then keeps track of
> how likely each of those tokens is to appear in either a ham or spam
> message. When a new message comes in, Bayes breaks it into tokens and
> then scores it depending on which tokens were found in the message.
>

Suppose we do not have manual Bayesian training. We only do online training
in which high and low scoring mails are fed to the learner [.is the a usual
thing to do? How many people manually train their bayesian filter?]
A high scoring spam is then fed to the learner. The spam is high scoring
since a few rules [regex] matched. Now the bayesian leaner would learn all
the tokens from this mail. Next time a mail [say M] with similar tokens is
seen, it would be flagged as spam [using bayes rule]. why would bayesian
learning be needed for us to say M is spam. Since it contains very much
similar words like earlier high scoring mails, shouldnt we expect the regex
rules to work for M as well? - since M is very much similar to those mails
from which we learnt from ?

Here is how I think bayesian is helpful [which could be be entirely my
misunderstanding]. Suppose a set of spam mails look like

"Please buy M3d1C1NE X at store Y for cheap".

Now spammers have obfuscated word "medicine" in the mail. Spammers send, say
a thousand spam each having a different way in which "medicine" is spelt
out, but all the other words around it remain nearly the same. Only some of
the first 100 of these mails would hit [say if there exists] a MEDICINE rule
[regex]. Those particular mails would have high spam scores and hence the
bayesian filter would learn that mails containing words "Please", "buy",
"at", "store", "for", "cheap" corresponds to have a high spam probability.

For 101st mail, if the regex MEDICINE is unable to match the obfuscated
text, then the mail would have a low score, but bayesian learner would say,
seeing the words surrounding obfuscated text, that this mail is spam.

Does it work this way? Does it work only this way [if not manually trained]?





--
View this message in context: http://www.nabble.com/Understanding-SpamAssassin-tp25549227p25605170.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


kremels at kreme

Sep 25, 2009, 12:58 AM

Post #7 of 11 (703 views)
Permalink
Re: Understanding SpamAssassin [In reply to]

On Sep 24, 2009, at 7:44 PM, poifgh wrote:
> For 101st mail, if the regex MEDICINE is unable to match the
> obfuscated
> text, then the mail would have a low score, but bayesian learner
> would say,
> seeing the words surrounding obfuscated text, that this mail is spam.

Essentially this is how it works. Bayes looks for tokens in the
messages and categorizes them as spam or ham depending on two factors,
the overall score or the specific command line flag. If the score is
high enough, then the message is learned as spam, which means all it's
tokens are classified as spam. If the score is low enough, the message
is learned as ham and its tokens are likewise classified as ham.
Tokens that appear in both classes cancel out, and new messages are
examined for tokens. Depending on how many there are of each type and
(and this is the clever bit) how strong each is an indicator of
spamishness/hamishness that is how the final bayes 'score' is weighted.

The reason the manual training is useful is that there is a wide range
of score in-between auto-learn ham and auto-learn spam.

A bayes_50 is a neutral score, and this is generally seen as a 0
weight score. However, in my experience quite a lot of emails with a
bayes_50 are actually spam. Ham messages tend to score out lower,
assuming your data is sufficiently large.

score BAYES_99 5.0
score BAYES_95 4.5
score BAYES_80 2
score BAYES_60 1.00
score BAYES_50 0.25
score BAYES_40 -0.50
score BAYES_20 -2.50
score BAYES_05 -3.50
score BAYES_00 -5.00

So yes, for me Bayes_99 is a poison pill, and 95 is close enough. I
have very little hitting _80 or _60 or _40, so these scores are
basically WAGs.

TOP SPAM RULES FIRED
RANK RULE NAME %OFMAIL %OFSPAM %OFHAM
1 BAYES_99 57.12 92.66 1.84
2 HTML_MESSAGE 78.17 79.89 75.51
3 URIBL_BLACK 43.66 70.76 1.49
4 RCVD_IN_JMF_BL 36.20 57.45 3.14
5 SPF_PASS 37.14 50.73 15.99
6 URIBL_JP_SURBL 28.99 47.56 0.10
7 URIBL_OB_SURBL 21.01 34.44 0.13
8 DKIM_SIGNED 31.58 31.10 32.33

TOP HAM RULES FIRED
RANK RULE NAME %OFMAIL %OFSPAM %OFHAM
1 AWL 45.92 19.29 87.37
2 HTML_MESSAGE 78.17 79.89 75.51
3 BAYES_00 21.30 0.08 54.31
4 RCVD_IN_JMF_W 16.63 0.78 41.29
5 DKIM_SIGNED 31.58 31.10 32.33
6 DKIM_VERIFIED 25.13 23.44 27.77
7 BAYES_50 11.88 1.94 27.36
8 SPF_PASS 37.14 50.73 15.99

Now, this is misleading here because this is looking at the spammed
log, and when ti gets right down to searching, a large number of
BAYES_50 messages will end up being classified as spam.

Other surprises are that DKIM is pretty useless and SPF_PASS is
actually a slight spam indicator.

--
if you ever get that chimp of your back, if you ever find the thing
you lack, ah but you know you're only having a laugh. Oh, oh
here we go again -- until the end.


me at junc

Sep 25, 2009, 2:23 AM

Post #8 of 11 (686 views)
Permalink
Re: Understanding SpamAssassin [In reply to]

On fre 25 sep 2009 09:58:41 CEST, LuKreme wrote
> Other surprises are that DKIM is pretty useless and SPF_PASS is
> actually a slight spam indicator.

you miss the point, there is no USER_IN_*

so without some whitelist_from_* dkim and spf will not be helpfull

if it was so you will have gived spammers a free ride, what you wanted ?

--
xpoint


Mark.Martinec+sa at ijs

Sep 25, 2009, 2:56 AM

Post #9 of 11 (687 views)
Permalink
Re: Understanding SpamAssassin [In reply to]

LuKreme wrote:
> Other surprises are that DKIM is pretty useless and SPF_PASS is
> actually a slight spam indicator.

Benny Pedersen wrote:
> so without some whitelist_from_* dkim and spf will not be helpfull

Indeed. Score points should be kept close to zero for rules
DKIM_SIGNED, DKIM_VALID and DKIM_VALID_AU (or DKIM_VERIFIED in pre-3.3).

The value of DKIM verification does not come from score points of these
informational rules directly, but from derived rules: from DKIM-based
whitelisting and from fraud protection (DKIM_ADSP_* rules with their
associated 'adsp_override' in 3.3.0, or hand written rules in pre-3.3).

Mark


kremels at kreme

Sep 25, 2009, 4:54 AM

Post #10 of 11 (684 views)
Permalink
Re: Understanding SpamAssassin [In reply to]

On 25-Sep-2009, at 03:56, Mark Martinec wrote:
LuKreme wrote:
>> Other surprises are that DKIM is pretty useless and SPF_PASS is
>> actually a slight spam indicator.
>
> Benny Pedersen wrote:
>> so without some whitelist_from_* dkim and spf will not be helpfull
>
> Indeed. Score points should be kept close to zero for rules
> DKIM_SIGNED, DKIM_VALID and DKIM_VALID_AU (or DKIM_VERIFIED in
> pre-3.3).

As they are, and I never said anything differently. I don't know where
Benny got he idea I was giving spammers a 'free ride.'

I meant to say "pretty useless on its own".




--
I think it's the duty of the comedian to find out where the line is
drawn and cross it deliberately.


Bowie_Bailey at BUC

Sep 25, 2009, 7:09 AM

Post #11 of 11 (678 views)
Permalink
Re: Understanding SpamAssassin [In reply to]

poifgh wrote:
> Bowie Bailey wrote:
>
>> For auto-learning, the high and low scoring messages are fed to Bayes.
>> However, for an optimal setup, you should manually train Bayes on as
>> much of your (verified) ham and spam as possible. The more of your mail
>> stream Bayes sees, the better the results will be.
>>
>> Your description of Bayes is pretty close. It breaks down the message
>> into "tokens" (words and character sequences) and then keeps track of
>> how likely each of those tokens is to appear in either a ham or spam
>> message. When a new message comes in, Bayes breaks it into tokens and
>> then scores it depending on which tokens were found in the message.
>>
>>
>
> Suppose we do not have manual Bayesian training. We only do online training
> in which high and low scoring mails are fed to the learner [.is the a usual
> thing to do? How many people manually train their bayesian filter?]
> A high scoring spam is then fed to the learner. The spam is high scoring
> since a few rules [regex] matched. Now the bayesian leaner would learn all
> the tokens from this mail. Next time a mail [say M] with similar tokens is
> seen, it would be flagged as spam [using bayes rule]. why would bayesian
> learning be needed for us to say M is spam. Since it contains very much
> similar words like earlier high scoring mails, shouldnt we expect the regex
> rules to work for M as well? - since M is very much similar to those mails
> from which we learnt from ?
>

Look at it this way -- Bayes is learning what your spam looks like and
what your ham looks like. Most of your spam will be caught by other
rules, but there are times when an email will come in that the main
rules do not catch. Bayes is frequently able to catch these because it
is looking at the message as a whole rather than looking for particular
words or phrases as the main regex rules do.

Manual training is not strictly required for Bayes, but the more manual
training you do, the higher the accuracy and the more useful it
becomes. At the least, you should manually train Bayes on all of your
false positives and false negatives. This can be scripted to happen
automatically based on folders which are expected to contain hand-sorted
spam and ham.

> Here is how I think bayesian is helpful [which could be be entirely my
> misunderstanding]. Suppose a set of spam mails look like
>
> "Please buy M3d1C1NE X at store Y for cheap".
>
> Now spammers have obfuscated word "medicine" in the mail. Spammers send, say
> a thousand spam each having a different way in which "medicine" is spelt
> out, but all the other words around it remain nearly the same. Only some of
> the first 100 of these mails would hit [say if there exists] a MEDICINE rule
> [regex]. Those particular mails would have high spam scores and hence the
> bayesian filter would learn that mails containing words "Please", "buy",
> "at", "store", "for", "cheap" corresponds to have a high spam probability.
>
> For 101st mail, if the regex MEDICINE is unable to match the obfuscated
> text, then the mail would have a low score, but bayesian learner would say,
> seeing the words surrounding obfuscated text, that this mail is spam.
>
> Does it work this way? Does it work only this way [if not manually trained]?
>

That is a pretty fair description of how it works regardless of how you
train it. The advantage of manual training is that you allow it to
learn from the lower scoring spam (and higher scoring ham), which are
the kinds of messages that can most use the extra points from the Bayes
rules.

--
Bowie

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.