
guenther at rudersport
Jul 24, 2013, 3:59 PM
Post #13 of 13
(135 views)
Permalink
|
On Wed, 2013-07-24 at 15:15 +0200, Simon Loewenthal wrote: > I rewrote this (not GTUBE anymore) and had the same bayes score > http://pastebin.com/ATqch32Y Simon, it seems you have a false understanding of Bayes and how it works. Quoting parts of the mail body from that paste: > You should send this from outside of your car. So you "rewrote" the body, or rather modified it slightly, replacing a few words. Note that it *did* change the result, now hitting BAYES_20 rather than 00 before. First things first: Your (original) test message is *based* on GTUBE. It isn't GTUBE at all, though -- which led to quite some confusion in this thread. Starting off with GTUBE (the mail), you stripped what GTUBE actually is: The weird 68 byte string. Then you took that message, which started off as the test message to verify mail gets passed through SA, and continued using it as test message -- modifying it to match your own rules. Nothing wrong about that, though I'd suggest to remove the (descriptive, textual, and actually meant as instructions) claim from the mail body to constitute GTUBE. Send it from outside of your car. Send it from outside of your network. This summarizes the "rewriting" you just did, in hopes for SA to magically stop hitting low Bayes rules. You're barking up the wrong tree -- you should outright ignore the Bayes score, when you actually are testing your own rules. Granted, short- circuiting on BAYES_00 got in the way, resulting in your own rules not being tested at all and thus not matching. [1] Solution: Disable short-circuiting, or prevent your test mail from sporting a really low Bayes score. Getting at that next. Bayes, more precisely the Bayesian probability, is the likeliness of the mail being solicited, wanted, hammy, you name it. The lower the number in the BAYES_nn rules, the more hammy. Also, the SA Bayes implementation considers tokens. Words, to keep it easy. It does not recognize and consider sentences, nor multi-word tokens. Example. See that quote above, and how you modified it. Let's assume that'd be the complete mail text and all Bayes get's to see. Keeping in mind the BAYES_nn change from 00 (less than 20% chance spam) to 20 (more than 20%, less than 40% chance spam), we see an impact of that change. Namely, the word "network" in mail has a much higher probability of the mail being ham -- compared to the word "car". You're either not into cars, or just don't like Lisp... How does that help preventing BAYES_00 hits? Remove hammy tokens from your test message. Even better, remove all the text. That is, the GTUBE instructions you kept. You don't want it being evaluated, your focus is on your rules for testing -- so why keep it around? In closing, there is absolutely no problem with your test mail hitting BAYES_00, unless these words and tokens are what all your spam looks like... [1] Taking a guess, that's probably why you had the impression Bayes might not have been hit at all before. Which is really unlikely. There's always a BAYES_nn rule indicating the Bayesian probability on a scale ranging from ham to spam. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
|