
kremels at kreme
Sep 25, 2009, 12:58 AM
Post #7 of 11
(703 views)
Permalink
|
On Sep 24, 2009, at 7:44 PM, poifgh wrote: > For 101st mail, if the regex MEDICINE is unable to match the > obfuscated > text, then the mail would have a low score, but bayesian learner > would say, > seeing the words surrounding obfuscated text, that this mail is spam. Essentially this is how it works. Bayes looks for tokens in the messages and categorizes them as spam or ham depending on two factors, the overall score or the specific command line flag. If the score is high enough, then the message is learned as spam, which means all it's tokens are classified as spam. If the score is low enough, the message is learned as ham and its tokens are likewise classified as ham. Tokens that appear in both classes cancel out, and new messages are examined for tokens. Depending on how many there are of each type and (and this is the clever bit) how strong each is an indicator of spamishness/hamishness that is how the final bayes 'score' is weighted. The reason the manual training is useful is that there is a wide range of score in-between auto-learn ham and auto-learn spam. A bayes_50 is a neutral score, and this is generally seen as a 0 weight score. However, in my experience quite a lot of emails with a bayes_50 are actually spam. Ham messages tend to score out lower, assuming your data is sufficiently large. score BAYES_99 5.0 score BAYES_95 4.5 score BAYES_80 2 score BAYES_60 1.00 score BAYES_50 0.25 score BAYES_40 -0.50 score BAYES_20 -2.50 score BAYES_05 -3.50 score BAYES_00 -5.00 So yes, for me Bayes_99 is a poison pill, and 95 is close enough. I have very little hitting _80 or _60 or _40, so these scores are basically WAGs. TOP SPAM RULES FIRED RANK RULE NAME %OFMAIL %OFSPAM %OFHAM 1 BAYES_99 57.12 92.66 1.84 2 HTML_MESSAGE 78.17 79.89 75.51 3 URIBL_BLACK 43.66 70.76 1.49 4 RCVD_IN_JMF_BL 36.20 57.45 3.14 5 SPF_PASS 37.14 50.73 15.99 6 URIBL_JP_SURBL 28.99 47.56 0.10 7 URIBL_OB_SURBL 21.01 34.44 0.13 8 DKIM_SIGNED 31.58 31.10 32.33 TOP HAM RULES FIRED RANK RULE NAME %OFMAIL %OFSPAM %OFHAM 1 AWL 45.92 19.29 87.37 2 HTML_MESSAGE 78.17 79.89 75.51 3 BAYES_00 21.30 0.08 54.31 4 RCVD_IN_JMF_W 16.63 0.78 41.29 5 DKIM_SIGNED 31.58 31.10 32.33 6 DKIM_VERIFIED 25.13 23.44 27.77 7 BAYES_50 11.88 1.94 27.36 8 SPF_PASS 37.14 50.73 15.99 Now, this is misleading here because this is looking at the spammed log, and when ti gets right down to searching, a large number of BAYES_50 messages will end up being classified as spam. Other surprises are that DKIM is pretty useless and SPF_PASS is actually a slight spam indicator. -- if you ever get that chimp of your back, if you ever find the thing you lack, ah but you know you're only having a laugh. Oh, oh here we go again -- until the end.
|