Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: users

ADDRESS_IN_SUBJECT et al

 

 

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded


vectro at vectro

Jul 24, 2013, 5:28 PM

Post #1 of 11 (149 views)
Permalink
ADDRESS_IN_SUBJECT et al

Hello list,

I notice that the old rule ADDRESS_IN_SUBJECT was dropped starting in
SpamAssassin 3.3 (The change is in bug 5123 and commit 467038). Lately,
however, I've started getting a lot of spam again where the To: address is in
the subject. Perhaps it's time to evaluate restoring this rule?

-Ian


guenther at rudersport

Jul 24, 2013, 6:23 PM

Post #2 of 11 (146 views)
Permalink
Re: ADDRESS_IN_SUBJECT et al [In reply to]

On Wed, 2013-07-24 at 20:28 -0400, Ian Turner wrote:
> I notice that the old rule ADDRESS_IN_SUBJECT was dropped starting in
> SpamAssassin 3.3 (The change is in bug 5123 and commit 467038). Lately,
> however, I've started getting a lot of spam again where the To: address is in
> the subject. Perhaps it's time to evaluate restoring this rule?

Well, how do they score usually? It's hardly worth adding a point if
they are rather high scoring anyway.

header LOCALPART_IN_SUBJECT eval:check_for_to_in_subject('user')

And all of them do hit that rule. A super-set of the ADDRESS variant,
using the local part instead of the complete address. Still in stock
rules.


--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


vectro at vectro

Jul 24, 2013, 6:53 PM

Post #3 of 11 (146 views)
Permalink
Re: ADDRESS_IN_SUBJECT et al [In reply to]

On Thursday, July 25, 2013 03:23:39 AM Karsten Bräckelmann wrote:
> On Wed, 2013-07-24 at 20:28 -0400, Ian Turner wrote:
> > I notice that the old rule ADDRESS_IN_SUBJECT was dropped starting in
> > SpamAssassin 3.3 (The change is in bug 5123 and commit 467038). Lately,
> > however, I've started getting a lot of spam again where the To: address is
> > in the subject. Perhaps it's time to evaluate restoring this rule?
>
> Well, how do they score usually? It's hardly worth adding a point if
> they are rather high scoring anyway.
>
> header LOCALPART_IN_SUBJECT eval:check_for_to_in_subject('user')
>
> And all of them do hit that rule. A super-set of the ADDRESS variant,
> using the local part instead of the complete address. Still in stock
> rules.

They are moderately low-scoring, sadly (I wouldn't have noticed otherwise!),
mainly due to bayes poison. A typical message looks like this:

0.0 NO_DNS_FOR_FROM DNS: Envelope sender has no MX or A DNS records
1.9 DATE_IN_FUTURE_06_12 Date: is 6 to 12 hours after Received: date
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
0.5 MISSING_MID Missing Message-Id: header
0.8 RDNS_NONE Delivered to internal network by a host with no
rDNS
0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid

Looking at the code for check_for_to_in_subject, it looks like the regular
expression used for LOCALPART_IN_SUBJECT is rather different (much more
specific) than the one used for ADDRESS_IN_SUBJECT. Presumably that's why this
rule doesn't match.

An example subject from this spam (address changed to protect the innocent):
<someone [at] example>_Need Approval for Fast Funds? July 24th 2013_

For "address" mode, the regex is this one: /\b\Q$full_to\E\b/i
But for "user" mode, the regex is this one:
/^(?:
(?:re|fw):\s*(?:\w+\s+)?\Q$to\E$
|(?-i:\Q$to\E)\s*[,:;!?-](?:$|\s)
|\Q$to\E$
|,\s*\Q$to\E[,:;!?-]$
)/ix

Among other restrictions, this regex seems to only match the username at the
beginning or end of the subject.

--Ian


guenther at rudersport

Jul 24, 2013, 8:15 PM

Post #4 of 11 (146 views)
Permalink
Re: ADDRESS_IN_SUBJECT et al [In reply to]

On Wed, 2013-07-24 at 21:53 -0400, Ian Turner wrote:
> They are moderately low-scoring, sadly (I wouldn't have noticed otherwise!),
> mainly due to bayes poison. A typical message looks like this:

Do you manually train them as spam?

> -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
> [score: 0.0000]

Ouch. A probability score of < 0.00005 -- which pretty much equals no
token learned as spammy. Seriously? How often do you see "Funds" (mind
the uppercase!) or "funds" in ham? How many of them do have that word in
the Subject (which in addition gets treated specially by SA)?

See where I am heading? Any chance your Bayes DB is completely borked?
sa-learn --dump magic

Might be worth putting a sample or three up a pastebin of your choice,
to see more of the text.

And for further digging, which are the top hammy / spammy tokens? See
M::SA::Conf [1], section Template Tags.


> Looking at the code for check_for_to_in_subject, it looks like the regular
> expression used for LOCALPART_IN_SUBJECT is rather different (much more
> specific) than the one used for ADDRESS_IN_SUBJECT. Presumably that's why this
> rule doesn't match.
>
> An example subject from this spam (address changed to protect the innocent):
> <someone [at] example>_Need Approval for Fast Funds? July 24th 2013_

Do the Subjects strictly follow that pattern? Including the angle
brackets AND the underscore? Dead easy target for a local rule to squat
them.

BTW, don't get me wrong, I am not trying to prevent the old eval() rule
from re-appearing. It's just that such pattern hasn't been mentioned as
an issue in like ages, so my focus is on helping with your issue first.


> For "address" mode, the regex is this one: /\b\Q$full_to\E\b/i
> But for "user" mode, the regex is this one:
> /^(?:
> (?:re|fw):\s*(?:\w+\s+)?\Q$to\E$
> |(?-i:\Q$to\E)\s*[,:;!?-](?:$|\s)
> |\Q$to\E$
> |,\s*\Q$to\E[,:;!?-]$
> )/ix
>
> Among other restrictions, this regex seems to only match the username at the
> beginning or end of the subject.

It does accept quite some more, including leading Re: with an optional,
arbitrary word following. Some restrictions are definitely necessary,
since the "local part" often resembles a user's first name, company
name, generic roles...

It does not match /^<localpart/ with a single opening angle bracket,
though.


[1] http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html

--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


vectro at vectro

Jul 25, 2013, 5:07 AM

Post #5 of 11 (144 views)
Permalink
Re: ADDRESS_IN_SUBJECT et al [In reply to]

On Thursday, July 25, 2013 05:15:19 AM Karsten Bräckelmann wrote:
> On Wed, 2013-07-24 at 21:53 -0400, Ian Turner wrote:
> > They are moderately low-scoring, sadly (I wouldn't have noticed
> > otherwise!),
> > mainly due to bayes poison. A typical message looks like this:
> Do you manually train them as spam?

Yes.

> > -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
> >
> > [score: 0.0000]
>
> Ouch. A probability score of < 0.00005 -- which pretty much equals no
> token learned as spammy. Seriously? How often do you see "Funds" (mind
> the uppercase!) or "funds" in ham? How many of them do have that word in
> the Subject (which in addition gets treated specially by SA)?

I work in finance. We talk about funds. :-) I have quite a bit of ham with
"Funds" or "funds" in the subject (but zip with the To: address in the
subject).

> See where I am heading? Any chance your Bayes DB is completely borked?
> sa-learn --dump magic

Not sure what to do with this, but here you go:0.000 0 3
0 non-token data: bayes db version
0.000 0 29074 0 non-token data: nspam
0.000 0 46274 0 non-token data: nham
0.000 0 158157 0 non-token data: ntokens
0.000 0 1369590693 0 non-token data: oldest atime
0.000 0 1374752584 0 non-token data: newest atime
0.000 0 0 0 non-token data: last journal sync
atime
0.000 0 1374712421 0 non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime
delta
0.000 0 0 0 non-token data: last expire reduction
count

> Might be worth putting a sample or three up a pastebin of your choice,
> to see more of the text.

http://pastebin.com/8ATfK7EJ
http://pastebin.com/VMX0rEkn
http://pastebin.com/eQYUf2st

> And for further digging, which are the top hammy / spammy tokens? See
> M::SA::Conf [1], section Template Tags.

They are in the pastes in the X-Spam-JPW-Report: header.

> > Looking at the code for check_for_to_in_subject, it looks like the regular
> > expression used for LOCALPART_IN_SUBJECT is rather different (much more
> > specific) than the one used for ADDRESS_IN_SUBJECT. Presumably that's why
> > this rule doesn't match.
> >
> > An example subject from this spam (address changed to protect the
> > innocent): <someone [at] example>_Need Approval for Fast Funds? July 24th
> > 2013_
> Do the Subjects strictly follow that pattern? Including the angle
> brackets AND the underscore? Dead easy target for a local rule to squat
> them.

They do, and I did. These spams are pretty easy to catch, they also have some
boilerplate at the bottom of each one that is the same every time.

Cheers,

--Ian


rwmaillists at googlemail

Jul 25, 2013, 5:26 AM

Post #6 of 11 (144 views)
Permalink
Re: ADDRESS_IN_SUBJECT et al [In reply to]

On Thu, 25 Jul 2013 05:15:19 +0200
Karsten Bräckelmann wrote:

> On Wed, 2013-07-24 at 21:53 -0400, Ian Turner wrote:
> > They are moderately low-scoring, sadly (I wouldn't have noticed
> > otherwise!), mainly due to bayes poison. A typical message looks
> > like this:
>
> Do you manually train them as spam?
>
> > -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
> > [score: 0.0000]
>
> Ouch. A probability score of < 0.00005 -- which pretty much equals no
> token learned as spammy. Seriously?


It's not as odd as you might think. In my experience spams with a
rounded Bayes score of 0.0000 commonly have five very strong spam
tokens in X-Spam-Tokens. They end up being swamped by a much larger
number of hammy tokens.


guenther at rudersport

Jul 25, 2013, 2:31 PM

Post #7 of 11 (137 views)
Permalink
Re: ADDRESS_IN_SUBJECT et al [In reply to]

On Thu, 2013-07-25 at 08:07 -0400, Ian Turner wrote:
> > See where I am heading? Any chance your Bayes DB is completely borked?
> > sa-learn --dump magic
>
> Not sure what to do with this, but here you go:

> 0.000 0 29074 0 non-token data: nspam
> 0.000 0 46274 0 non-token data: nham
> 0.000 0 158157 0 non-token data: ntokens

You have trained more ham than spam. That's not necessarily a problem,
and opinions differ greatly. But it might be indication your Bayes is
skewed.

> > And for further digging, which are the top hammy / spammy tokens? See
> > M::SA::Conf [1], section Template Tags.
>
> They are in the pastes in the X-Spam-JPW-Report: header.

That's useful. All three samples are rather similar, including the
tokens. Some headers from the first one:

Return-path: <bounce-248-77802767-someone=example.com [at] clickettle>

Hammy tokens:
0.001-+--HX-Envelope-From:sk:someone.,
0.005-+--HX-Envelope-From:sk:bounce-,
0.006-+--HX-Spam-Relays-External:sk:someone.,
0.006-+--H*RU:sk:someone.,
0.016-1--loans;

Spammy tokens:
0.903-+--Fast,
0.862-1--33179,
0.847-1--Miami,
0.847-1--miami

HAM: The strong hammy tokens are almost exclusively from the headers
(*RU stands for Untrusted Relays). In particular the X-Envelope-From
tokens are suspicious, given the Envelope-From / Return-Path value.

Are you filtering mailing-list traffic through SA? Do you manually train
them as ham, or did that happen by auto-learning?

If it's not mailing-lists but e.g. due to newsletters and stuff, if
might be worth to bayes_ignore_header the Envelope-From header. The most
hammy tokens are "bounce detection" and "have own address in RP".

SPAM: The spammy tokens are highly suspicious, too. As you confirmed,
you are manually training these as spam. And all three samples feature
an address in "Miami, FL 33179" at the bottom.

Yet, the declassification distance for "33179", "Miami" and "miami" (lc
version of the former, generated by SA Bayes) is a mere 1. Which means,
learning the token as the opposite just *once* makes them lose the
current classification.

Which seems rather unlikely, unless you frequently have such addresses
in ham, too. Nah, still unlikely.

Besides, once these are declassified, there would be only a single
spammy token left -- the header above shows there are only 4. There
simply was no 5th spammy token in that message also in the database.


Do you use site-wide or per-user Bayes? Do you (manually) train by the
same user SA runs as while filtering?

You also might need some serious spam training.


--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


rwmaillists at googlemail

Jul 26, 2013, 4:46 AM

Post #8 of 11 (131 views)
Permalink
Re: ADDRESS_IN_SUBJECT et al [In reply to]

On Thu, 25 Jul 2013 23:31:57 +0200
Karsten Bräckelmann wrote:


> Spammy tokens:
> 0.903-+--Fast,
> 0.862-1--33179,
> 0.847-1--Miami,
> 0.847-1--miami

> SPAM: The spammy tokens are highly suspicious, too. As you confirmed,
> you are manually training these as spam. And all three samples feature
> an address in "Miami, FL 33179" at the bottom.
>
> Yet, the declassification distance for "33179", "Miami" and
> "miami" (lc version of the former, generated by SA Bayes) is a mere
> 1. Which means, learning the token as the opposite just *once* makes
> them lose the current classification.


The threshold for classification is 0.846. It would be remarkable if
these tokens didn't have a declassification distance of 1.


guenther at rudersport

Jul 26, 2013, 3:34 PM

Post #9 of 11 (126 views)
Permalink
Re: ADDRESS_IN_SUBJECT et al [In reply to]

On Fri, 2013-07-26 at 12:46 +0100, RW wrote:
> On Thu, 25 Jul 2013 23:31:57 +0200 Karsten Bräckelmann wrote:

> > SPAM: The spammy tokens are highly suspicious, too. As you confirmed,
> > you are manually training these as spam. And all three samples feature
> > an address in "Miami, FL 33179" at the bottom.
> >
> > Yet, the declassification distance for "33179", "Miami" and
> > "miami" (lc version of the former, generated by SA Bayes) is a mere
> > 1. Which means, learning the token as the opposite just *once* makes
> > them lose the current classification.
>
> The threshold for classification is 0.846. It would be remarkable if
> these tokens didn't have a declassification distance of 1.

That's not the point, though. With correct manual training the
declassification distance should be higher. And frankly, there should be
more than 4 spammy tokens at all.

Caveat: Assuming, to gather these Bayes Token headers, SA has been run
as the same user it's processing incoming mail.


--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


cepheid at 3phase

Jul 31, 2013, 2:39 PM

Post #10 of 11 (86 views)
Permalink
Re: ADDRESS_IN_SUBJECT et al [In reply to]

At 3:23 AM +0200 07/25/2013, Karsten Bräckelmann wrote:
> header LOCALPART_IN_SUBJECT eval:check_for_to_in_subject('user')
>
>And all of them do hit that rule. A super-set of the ADDRESS variant,
>using the local part instead of the complete address. Still in stock
>rules.

Hmmmmm. One of my users has received at least
two spams in recent days with his email address
in the Subject line. No LOCALPART or ADDRESS
rule hit on either email. sa-update is running
nightly and rules are being updated... any idea
why this check may not have been run, and/or may
not have hit?

I can provide a pastebin if it would be helpful.

(Also, how do I find out what Bayes tokens hit on
an email? Is there a plugin for this? Is there
any way to print out the Bayes DB in plain
English, or are the tokens stored only as one-way
hashes so there's no way to recover the original
word?)

Thanks.

--- Amir


vectro at vectro

Aug 2, 2013, 7:14 PM

Post #11 of 11 (53 views)
Permalink
Re: ADDRESS_IN_SUBJECT et al [In reply to]

On Thursday, July 25, 2013 11:31:57 PM Karsten Bräckelmann wrote:
> You have trained more ham than spam. That's not necessarily a problem,
> and opinions differ greatly. But it might be indication your Bayes is
> skewed.

Hmm. I'm not really sure how that can be. Anything detected as spam is
rejected at SMTP time and autolearned. Everything else is autolearned as ham,
although my users are pretty good at picking out the spam and reporting it as
such (we then relearn as ham).

> Are you filtering mailing-list traffic through SA? Do you manually train
> them as ham, or did that happen by auto-learning?

Yes (not my mailing lists but others'), by auto-learning.

> Do you use site-wide or per-user Bayes? Do you (manually) train by the
> same user SA runs as while filtering?

Site-wide. All training is done as the SA user. The bayes database only
contains one userid.

> You also might need some serious spam training.

What does this mean?

--Ian

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.