Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: users

New spam rule for specific content

 

 

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded


cepheid at 3phase

Aug 9, 2013, 10:19 AM

Post #1 of 24 (73 views)
Permalink
New spam rule for specific content

Hi all,

A number of my users have been receiving spam formatted in a
very specific way which seems to very often miss Bayes... I don't
know why, whether it's because of the HTML gibberish flooding Bayes
with useless tokens (to reduce the relative strength of the spammy
tokens), or if it's just the specific content isn't sufficiently
spammy (or has sufficient ham to balance) to pop.
Either way, this spam appears to be generated from a specific
template, and I've created a rule to hit that template. Within the
last couple of weeks, I've had only true positives and negatives...
no FPs, no FNs.

For your perusal, here is the rule:

# Spammy URI pattern
uri __OUTL_URI /\/outl\b/
uri __OUTI_URI /\/outi\b/
meta OUTL_OUTI_IS_SPAMMY (__OUTL_URI && __OUTI_URI)
describe OUTL_OUTI_IS_SPAMMY /outl + /outi link combo is highly spammy
score OUTL_OUTI_IS_SPAMMY 3

If you don't specifically trust URI rules to not have FPs, I have a
rawbody version of this which works identically... in all cases, both
rules pop together, so I think there's no specific need to use the
rawbody version, but I can provide it if needed.

I recommend this rule be added to the general distribution.

(Like many other users here, I've also increased the Bayes scores for
Bayes99, and created a Bayes999 with even higher scoring... it might
be time to add that to the general distribution, too.)

Hope this helps...

--- Amir


jhardin at impsec

Aug 9, 2013, 10:41 AM

Post #2 of 24 (72 views)
Permalink
Re: New spam rule for specific content [In reply to]

On Fri, 9 Aug 2013, Amir 'CG' Caspi wrote:

> A number of my users have been receiving spam formatted in a very
> specific way which seems to very often miss Bayes...

Can you provide a spample or two?

> I recommend this rule be added to the general distribution.

They can be added but unless such spams appear in the masscheck corpora
the rules won't be scored and distributed.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin [at] impsec FALaholic #11174 pgpk -a jhardin [at] impsec
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The first time I saw a bagpipe, I thought the player was torturing
an octopus. I was amazed they could scream so loudly.
-- cat_herder_5263 on Y! SCOX
-----------------------------------------------------------------------
6 days until the 68th anniversary of the end of World War II


rwmaillists at googlemail

Aug 9, 2013, 12:01 PM

Post #3 of 24 (72 views)
Permalink
Re: New spam rule for specific content [In reply to]

On Fri, 9 Aug 2013 11:19:08 -0600
Amir 'CG' Caspi wrote:

> A number of my users have been receiving spam formatted in a
> very specific way which seems to very often miss Bayes... I don't
> know why, whether it's because of the HTML gibberish flooding Bayes
> with useless tokens (to reduce the relative strength of the spammy
> tokens), or if it's just the specific content isn't sufficiently
> spammy (or has sufficient ham to balance) to pop.

BAYES works on rendered text it doesn't see the HTML.


> (Like many other users here, I've also increased the Bayes scores for
> Bayes99, and created a Bayes999 with even higher scoring... it might
> be time to add that to the general distribution, too.)

Do you actually get a significant amount of ham between 0.99 and 0.999?
Personally I only get 1 in 1000 above 0.55, and nothing above 0.65.


cepheid at 3phase

Aug 9, 2013, 4:32 PM

Post #4 of 24 (66 views)
Permalink
Re: New spam rule for specific content [In reply to]

On Fri, August 9, 2013 1:01 pm, RW wrote:
> BAYES works on rendered text it doesn't see the HTML.

Hmmm. It doesn't see HTML comments, which would appear in rendered HTML
source even though they are "invisible?" OK, in that case, I have NO idea
why the spam isn't hitting Bayes, because it looks pretty damn spammy to
me. I wonder if it's the heavy use of images, but I don't know.

> Do you actually get a significant amount of ham between 0.99 and 0.999?
> Personally I only get 1 in 1000 above 0.55, and nothing above 0.65.

Ham, absolutely not. So yes, I suppose I could just treat all Bayes99 as
if it were Bayes999 and score it more highly than I do. Right now I have
Bayes99 at 4, Bayes999 at 4.5. I could eliminate Bayes999 and make
Bayes99 score 4.5... but I do worry a little bit about FPs, even though I
guess I shoudn't, statistically speaking.

On the other hand, one could consider making Bayes999 a poison pill.
Generally spam will only rank there if you've learned something nearly
identical to it. At that point, perhaps it might be worth just scoring it
with 5 or higher (assuming your threshold is 5, as mine is).

--- Amir


cepheid at 3phase

Aug 9, 2013, 11:26 PM

Post #5 of 24 (65 views)
Permalink
Re: New spam rule for specific content [In reply to]

At 10:41 AM -0700 08/09/2013, John Hardin wrote:
>Can you provide a spample or two?

Sure.
http://pastebin.com/VfSCB7fw
http://pastebin.com/VCtvzjzV

Note the "outl" and "outi" links near the very bottom. The actual
domains used in these URIs vary... they used to be .pw, but recently
most have been .biz (though I've also seen some .mobi and I think
some .tv and even some .us).

Note that both of these hit BAYES_50... and that's pretty common for
these spams. For whatever reason, I don't know why, they seem to
only hit BAYES_50 and very rarely get higher scores (occasionally
they will get lower scores, too). Perhaps it's because most of the
spam is actually in the embedded image, rather than in rendered
text...

These are also great examples of the "HTML comment gibberish" that
pervades all of these spams. If you have time, it would be great if
you could adapt your STYLE_GIBBERISH rules to catch HTML comment
gibberish. (Presumably, you'd want to make sure the gibberish is
sufficiently long, too.)

>They can be added but unless such spams appear in the masscheck
>corpora the rules won't be scored and distributed.

No idea if they're in the masscheck corpora... but I and my users
have been getting them for months. I imagine they're relatively
widespread...

Thanks.

--- Amir


cepheid at 3phase

Aug 10, 2013, 12:41 PM

Post #6 of 24 (62 views)
Permalink
Re: New spam rule for specific content [In reply to]

At 10:41 AM -0700 08/09/2013, John Hardin wrote:
>Can you provide a spample or two?

Looks like a similar spam method has come out in recent weeks (since
Jul 30, it seems) that uses slightly different footers... example is
here:

http://pastebin.com/QCmSPzwG

Although running SA on this spam _NOW_ yields a high score beyond the
spam threshold, this is almost entirely because additional network
tests are now hitting (extra RBLs + Razor). This was not the case
when the spam was first processed... looks like I was one of the
earlier recipients.

For this type, looks like a good match would be on the combo of
"/land/" + "/unsub/" + "/report/" ... I have modified my rule from
yesterday as follows:

# Spammy URI patterns
uri __OUTL_URI /\/outl\b/
uri __OUTI_URI /\/outi\b/
uri __LAND_URI /\/land\//
uri __UNSUB_URI /\/unsub\//
uri __REPORT_URI /\/report\//
meta SPAMMY_URI_PATTERNS ((__OUTL_URI && __OUTI_URI) ||
(__LAND_URI && __UNSUB_URI && __REPORT_URI))
describe SPAMMY_URI_PATTERNS link combos match highly spammy template
score SPAMMY_URI_PATTERNS 3

This modification hits both types of templates. I will very likely
be adding further "spammy patterns" to this rule over time. I'll
keep the list posted if I find some other good ones.


It looks like both this and the previous type of spam are bypassing
Bayes by embedding images and using no rendered text. Well, not NO
text, but very little, mostly a "successful delivery" message and the
unsub/report links. That is, Bayes sees absolutely no "spammy" text,
just the image which it cannot decode as spammy.

Are there any rules which can hit on "only embedded images with very
little text" ?? Not entirely sure how to capture this since it's
difficult to determine what is "not much" text and there is certainly
the potential for FPs that way (for example, anyone in the design
field sending images to clients without much text, etc.)...

But, these types of spams are bypassing SA consistently, to the tune
of tens per day per user. I would really love a way to stop them
besides hardcoding a rule based on their link syntax, which can be
easily changed during the next iteration of their spam template.

(The HTML comment gibberish rule would be a big step here, since
that's one of the few things that would distinguish this from ham...
unlikely that a real person would embed tens of KB of comment
gibberish.)

Thanks.

--- Amir


jhardin at impsec

Aug 10, 2013, 2:17 PM

Post #7 of 24 (62 views)
Permalink
Re: New spam rule for specific content [In reply to]

On Sat, 10 Aug 2013, Amir 'CG' Caspi wrote:

> It looks like both this and the previous type of spam are bypassing Bayes by
> embedding images and using no rendered text. Well, not NO text, but very
> little, mostly a "successful delivery" message and the unsub/report links.
> That is, Bayes sees absolutely no "spammy" text, just the image which it
> cannot decode as spammy.

Perhaps it's time to bring FuzzyOCR up-to-date...?


--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin [at] impsec FALaholic #11174 pgpk -a jhardin [at] impsec
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The social contract exists so that everyone doesn't have to squat
in the dust holding a spear to protect his woman and his meat all
day every day. It does not exist so that the government can take
your spear, your meat, and your woman because it knows better what
to do with them. -- Dagny @ Ace of Spades
-----------------------------------------------------------------------
5 days until the 68th anniversary of the end of World War II


cepheid at 3phase

Aug 10, 2013, 5:02 PM

Post #8 of 24 (57 views)
Permalink
Re: New spam rule for specific content [In reply to]

At 2:17 PM -0700 08/10/2013, John Hardin wrote:
>
>Perhaps it's time to bring FuzzyOCR up-to-date...?

Is this something I need to manually update or something that needs
updating in the SA distribution?

Thanks.

--- Amir


jhardin at impsec

Aug 10, 2013, 8:23 PM

Post #9 of 24 (54 views)
Permalink
Re: New spam rule for specific content [In reply to]

On Sat, 10 Aug 2013, Amir 'CG' Caspi wrote:

> At 2:17 PM -0700 08/10/2013, John Hardin wrote:
>
>> Perhaps it's time to bring FuzzyOCR up-to-date...?
>
> Is this something I need to manually update or something that needs updating
> in the SA distribution?

FuzzyOCR was a SA plugin a few years back. It would pass images through
OCR and, IIRC, pull words out of them into the generated body that SA
scans.

Spammers moved away from putting their spams into images, so it fell out
of use, and I don't think it works with the current release of SA. Also,
Passing all attached images through OCR is a fairly heavy-weight process.

Now spammers seem to be moving back towards image spams, at least to a
degree.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin [at] impsec FALaholic #11174 pgpk -a jhardin [at] impsec
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
...in the 2nd amendment the right to arms clause means you have
the right to choose how many arms you want, and the militia clause
means that Congress can punish you if the answer is "none."
-- David Hardy, 2nd Amendment scholar
-----------------------------------------------------------------------
5 days until the 68th anniversary of the end of World War II


cepheid at 3phase

Aug 11, 2013, 1:22 AM

Post #10 of 24 (54 views)
Permalink
Re: New spam rule for specific content [In reply to]

At 1:41 PM -0600 08/10/2013, Amir 'CG' Caspi wrote:
>(The HTML comment gibberish rule would be a big step here, since
>that's one of the few things that would distinguish this from ham...
>unlikely that a real person would embed tens of KB of comment
>gibberish.)

OK, I'm trying to test an HTML comment gibberish rule and having some
problems. I'm using the following test spam, the same I showed
before:
http://pastebin.com/VCtvzjzV

I'm testing the following rule:

# HTML comment gibberish
rawbody HTML_COMMENT_GIBBERISH /<!--\s*(?:[\w'"?.:;-]+\s+){100,}\s*-->/im
tflags HTML_COMMENT_GIBBERISH multiple
describe HTML_COMMENT_GIBBERISH lots of spammy text in HTML comment
score HTML_COMMENT_GIBBERISH 0.001

Now, when I run this test spam through SA, I do get a hit, but only a
single hit... the rule is popping for the final HTML comment (the one
beginning with "Simpsons"). However, there are two other HTML
comments in this email, prior to the one that hit... for some reason,
they are not hitting, even though I've set tflags=multiple. (I was
considering having a meta rule that scored extra for multiple
comments.)

My regex is valid and appropriate for those comments... I tested it
at regexpal.com, which shows that all three comments match just fine
(all three get highlighted).

So... why is SA hitting only on the final comment, and ignoring the
first two? (I tried using a meta rule that popped if this hit more
than once, and that meta rule did not pop. SA debug output shows
only this one comment hitting, not the other two.) If my regex is
fine and I've got tflags=multiple, what's preventing the other
comments from hitting?

Thanks.

--- Amir


me at junc

Aug 11, 2013, 8:10 AM

Post #11 of 24 (53 views)
Permalink
Re: New spam rule for specific content [In reply to]

Amir 'CG' Caspi skrev den 2013-08-11 10:22:

> http://pastebin.com/VCtvzjzV


Content analysis details: (10.9 points, 5.0 required)

pts rule name description
---- ----------------------
--------------------------------------------------
-0.0 RCVD_IN_MSPIKE_H3 RBL: Good reputation (+3)
[5.39.218.213 listed in wl.mailspike.net]
0.1 RELAY_NL Relayed through NL
0.5 MSG_ID_INSTAFILE_BIZ spamming instafile.biz in message id
0.5 STARS_ON_FORTY_FIVE URI: contains 5 chars url at end
0.1 STARS_ON_FORTY_FOOR URI: contains 4 chars url at end
0.1 HTML_ERROR_TAGS_X_HTML RAW: error x-html not found on w3.org
2.4 RAZOR2_CF_RANGE_E8_51_100 Razor2 gives engine 8 confidence level
above 50%
[cf: 100]
0.4 RAZOR2_CF_RANGE_51_100 Razor2 gives confidence level above 50%
[cf: 100]
1.7 RAZOR2_CHECK Listed in Razor2 (http://razor.sf.net/)
-0.0 RCVD_IN_MSPIKE_WL Mailspike good senders
1.8 LONGWORDS Long string of long words
2.0 MIME_NO_TEXT No (properly identified) text body parts
1.3 SAGREY Adds score to spam from first-time senders


i created MSG_ID_INSTAFILE_BIZ and HTML_ERROR_TAGS_X_HTML , but even
without this rules its spam


cepheid at 3phase

Aug 11, 2013, 11:08 AM

Post #12 of 24 (50 views)
Permalink
Re: New spam rule for specific content [In reply to]

On Aug 11, 2013, at 9:10 AM, Benny Pedersen <me [at] junc> wrote:

> i created MSG_ID_INSTAFILE_BIZ and HTML_ERROR_TAGS_X_HTML , but even without this rules its spam

It is NOW, it was not when it was originally processed, as you can see from the SA headers included in the pastebin. If you read the messages I sent earlier, the network tests did not all hit because the spam was too young (had not yet been reported to all the services). LONGWORDS also did not hit for some reason, see the second email I sent regarding this (the test seems to not work properly on MIME content). Without these, and because this is an image-based spam that evades Bayes, the message did not pass the spam threshold originally, even though it does now.

My question is not whether this is spam. My question is why the new HTML_COMMENT_GIBBERISH rule only got one hit on the third comment when it should have hit all three comments...

Thanks.

--- Amir


cepheid at 3phase

Aug 11, 2013, 5:54 PM

Post #13 of 24 (34 views)
Permalink
Re: New spam rule for specific content [In reply to]

At 2:22 AM -0600 08/11/2013, Amir 'CG' Caspi wrote:
>My regex is valid and appropriate for those comments... I tested it
>at regexpal.com, which shows that all three comments match just fine
>(all three get highlighted).
>
>So... why is SA hitting only on the final comment, and ignoring the first two?

Further confusion. Received another of these types of spam today:

http://pastebin.com/YywcFkui

My new HTML_COMMENT_GIBBERISH rule didn't hit on this one at all.
Running the email through regexpal.com shows that the regex _DOES_
hit the comment. Why is this failing in SA even though it works in
other environments? Is there something that Perl doesn't like about
my regex syntax but that works fine in JavaScript?

Whatever is causing this to fail is probably the same thing causing
only the single (versus triple) hit on the previous example.

Your help in debugging would be greatly appreciated...

Thanks!

--- Amir


mysqlstudent at gmail

Aug 11, 2013, 6:31 PM

Post #14 of 24 (32 views)
Permalink
Re: New spam rule for specific content [In reply to]

Hi,

> Further confusion. Received another of these types of spam today:
>
> http://pastebin.com/YywcFkui
>
> My new HTML_COMMENT_GIBBERISH rule didn't hit on this one at all. Running

Can you post this rule again so we can investigate?

How do you find the SPAMMY_URI_PATTERNS rule is performing? It seems
very prone to FPs.

Why is there no BAYES score?

Are you using sqlgrey? If not, it's incredible and you should try it.

Regards,
Alex


cepheid at 3phase

Aug 11, 2013, 6:44 PM

Post #15 of 24 (32 views)
Permalink
Re: New spam rule for specific content [In reply to]

At 9:31 PM -0400 08/11/2013, Alex wrote:
>Can you post this rule again so we can investigate?

# HTML comment gibberish
# Looks for sequence of 100 or more "words" (alphanum + punct
separated by whitespace) within HTML comment
rawbody HTML_COMMENT_GIBBERISH /<!--\s*(?:[\w'"?!.:;-]+\s+){100,}\s*-->/im
describe HTML_COMMENT_GIBBERISH lots of spammy text in HTML comment
score HTML_COMMENT_GIBBERISH 0.001

regexpal says my rule matches the comment. SA doesn't agree.

>How do you find the SPAMMY_URI_PATTERNS rule is performing? It seems
>very prone to FPs.

It's performing quite well for me... I haven't seen any FPs on it.
The patterns are based on specific spam templates... one looks for
/outl and /outi URIs, the other is /land/ + /unsub/ + /report/ ...
these URIs have to occur in combination. You are correct that it has
the potential for FPs but I haven't seen any so far.

>Why is there no BAYES score?

I ran this test through the root account which does not have a Bayes
DB, so there's no Bayes score. There was a Bayes score on the
original email, which was Bayes50 just like every other one of these
types of spams (no real text, just a spammy image which SA isn't
decoding).

>Are you using sqlgrey? If not, it's incredible and you should try it.

I have not implemented any sort of greylisting yet. I can't use
sqlgrey because I don't use postfix... my server runs sendmail. I'm
sure there are some good sendmail-compatible greylisters but I
haven't tried them yet... I'm a bit worried about legitimate email
getting bounced. I'm sure I'll get to it in due course, though...

Thanks.

--- Amir


jhardin at impsec

Aug 11, 2013, 6:56 PM

Post #16 of 24 (32 views)
Permalink
Re: New spam rule for specific content [In reply to]

On Sun, 11 Aug 2013, Amir 'CG' Caspi wrote:

> At 2:22 AM -0600 08/11/2013, Amir 'CG' Caspi wrote:
>> My regex is valid and appropriate for those comments... I tested it at
>> regexpal.com, which shows that all three comments match just fine (all
>> three get highlighted).
>>
>> So... why is SA hitting only on the final comment, and ignoring the first
>> two?
>
> Further confusion. Received another of these types of spam today:
>
> http://pastebin.com/YywcFkui
>
> My new HTML_COMMENT_GIBBERISH rule didn't hit on this one at all.

Thanks for the samples, and apologies for the tardy reply.

A COMMENT_GIBBERISH rule has been in my sandbox for a while now, but it is
not performing well in masscheck.

I broadened it a bit per your samples and it hits all of them now. We'll
see if this change improves the masscheck performance. I'm also going to
make FP-avoidance changes that should also help.

> Running the email through regexpal.com shows that the regex _DOES_ hit
> the comment. Why is this failing in SA even though it works in other
> environments? Is there something that Perl doesn't like about my regex
> syntax but that works fine in JavaScript?

I haven't tested your rule yet, but I have a comment: you are trying a bit
too hard. Don't worry about matching all the way to the end of the
comment. You don't care about gibberish past the first 100 "words". Just
make sure that the rule does not match the --> comment-end token, and stop
at 100 matched words. Past that it doesn't matter.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin [at] impsec FALaholic #11174 pgpk -a jhardin [at] impsec
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The fetters imposed on liberty at home have ever been forged out
of the weapons provided for defense against real, pretended, or
imaginary dangers from abroad. -- James Madison, 1799
-----------------------------------------------------------------------
4 days until the 68th anniversary of the end of World War II


cepheid at 3phase

Aug 11, 2013, 7:04 PM

Post #17 of 24 (32 views)
Permalink
Re: New spam rule for specific content [In reply to]

At 6:56 PM -0700 08/11/2013, John Hardin wrote:
>I'm also going to make FP-avoidance changes that should also help.

Care to share? =)

>Just make sure that the rule does not match the --> comment-end token

I tried doing that and it caused SA to hang... couldn't figure out
why the regex wasn't working, but for whatever reason, it wasn't. I
figured it was easier to just match the entire comment.
Is there any particular reason to NOT match the entire
comment? That is, does it save resources (CPU, RAM, etc.) to match
only partial content?

Note that you do want to allow HTML tags within the comment... my
rule doesn't actually allow that, but I've seen spams with HTML tags
(mostly <p> and <div>) in the comments... we don't want to exclude
those.

Care to post your updated rule?

Either way, I would still love to know why my rule isn't hitting on this...

Thanks.

--- Amir


jhardin at impsec

Aug 11, 2013, 7:07 PM

Post #18 of 24 (32 views)
Permalink
Re: New spam rule for specific content [In reply to]

On Sun, 11 Aug 2013, Amir 'CG' Caspi wrote:

> At 9:31 PM -0400 08/11/2013, Alex wrote:
>> Are you using sqlgrey? If not, it's incredible and you should try it.
>
> I have not implemented any sort of greylisting yet. I can't use sqlgrey
> because I don't use postfix... my server runs sendmail. I'm sure there are
> some good sendmail-compatible greylisters but I haven't tried them yet...

milter-greylist is what I use, it seems to do the job, and it does reduce
the spam volume.

> I'm a bit worried about legitimate email getting bounced.

The only problem would be with a sending MTA that either is badly
misconfigured or cannot properly deal with a tempfail result and either
bounces the message as undeliverable or (worse) quietly drops it.

Sadly there are some major players with this problem who are apparently
uninterested in fixing their systems. I suggest you do a bit of research
on whitelists for greylisting before implementation. You would also
probably want to whitelist known regular correspondents.

There's also the need to set your users' expectations. They should be
trained that email is *not*, and is not intended to be, instantaneous.


--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin [at] impsec FALaholic #11174 pgpk -a jhardin [at] impsec
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The fetters imposed on liberty at home have ever been forged out
of the weapons provided for defense against real, pretended, or
imaginary dangers from abroad. -- James Madison, 1799
-----------------------------------------------------------------------
4 days until the 68th anniversary of the end of World War II


jhardin at impsec

Aug 11, 2013, 7:20 PM

Post #19 of 24 (32 views)
Permalink
Re: New spam rule for specific content [In reply to]

On Sun, 11 Aug 2013, Amir 'CG' Caspi wrote:

> At 6:56 PM -0700 08/11/2013, John Hardin wrote:
>> I'm also going to make FP-avoidance changes that should also help.
>
> Care to share? =)

Everything is publicly visible in my sandbox:
http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/jhardin/

The results for the rule set are here:
http://ruleqa.spamassassin.org/detail?rule=%2FCOMMENT_GIBBERISH&srcpath=jhardin

>> Just make sure that the rule does not match the --> comment-end token
>
> I tried doing that and it caused SA to hang... couldn't figure out why the
> regex wasn't working, but for whatever reason, it wasn't.

The unbounded matches you're using probably caused the RE engine to get
stuck backing off and retrying. REs are by default "greedy", they try to
match as much as possible.

In general it is a *VERY BAD* idea to use "*" or "+" in SA REs; they are
only really safe in rules that process data that is already limited in
size, like uri rules or header rules that look at a specific header. Make
it a habit to use bounded matches, {0,n} rather than "*" and {1,n} rather
than "+". The upper bound of {n} will limit how much the engine will back
off and retry.

Our rules are similar, take a look at what I have in the sandbox.

> I figured it was easier to just match the entire comment.
> Is there any particular reason to NOT match the entire comment? That
> is, does it save resources (CPU, RAM, etc.) to match only partial content?

It does. The less text you match beyond what you need to, the less
processing is performed. Nothing is done with the matched text, so the
extra work done matching all the way to the end of the comment is wasted.

> Note that you do want to allow HTML tags within the comment... my rule
> doesn't actually allow that, but I've seen spams with HTML tags (mostly <p>
> and <div>) in the comments... we don't want to exclude those.

Yuck. Can you pastbin spamples, if you still have them?


--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin [at] impsec FALaholic #11174 pgpk -a jhardin [at] impsec
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Efficiency can magnify good, but it magnifies evil just as well.
So, we should not be surprised to find that modern electronic
communication magnifies stupidity as *efficiently* as it magnifies
intelligence. -- Robert A. Matern
-----------------------------------------------------------------------
4 days until the 68th anniversary of the end of World War II


cepheid at 3phase

Aug 11, 2013, 7:46 PM

Post #20 of 24 (32 views)
Permalink
Re: New spam rule for specific content [In reply to]

At 7:20 PM -0700 08/11/2013, John Hardin wrote:
>The unbounded matches you're using probably caused the RE engine to
>get stuck backing off and retrying.

That's what I figured. That's why I changed things to the current
version, which is "bounded" by the end-tag of the comment. My
current version doesn't take long to run.


>Yuck. Can you pastbin spamples, if you still have them?

Here's one that comes to mind:

http://pastebin.com/zVEH2h02

I have a couple of others but they look like they're from the same
template, so I don't think it's useful to post.

--- Amir


jhardin at impsec

Aug 11, 2013, 8:23 PM

Post #21 of 24 (32 views)
Permalink
Re: New spam rule for specific content [In reply to]

On Sun, 11 Aug 2013, Amir 'CG' Caspi wrote:

> At 7:20 PM -0700 08/11/2013, John Hardin wrote:
>> Yuck. Can you pastbin spamples, if you still have them?
>
> Here's one that comes to mind:
>
> http://pastebin.com/zVEH2h02

That's going to be problematic as the comment isn't gibberish, it's a
bunch of properly-formed sentences.

However, I may be taking too-conservative a stance here. It's possible
that, while HTML comments can appear in ham, *long* HTML comments won't,
and the fact that we're looking for long blocks of comment text is enough
safety.

I'll play around with that sample and see what happens.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin [at] impsec FALaholic #11174 pgpk -a jhardin [at] impsec
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
[People] are socialists because they are blinded by envy and
ignorance. -- economist Ludwig von Mises (1881-1973)
-----------------------------------------------------------------------
4 days until the 68th anniversary of the end of World War II


cepheid at 3phase

Aug 11, 2013, 9:26 PM

Post #22 of 24 (31 views)
Permalink
Re: New spam rule for specific content [In reply to]

At 8:23 PM -0700 08/11/2013, John Hardin wrote:
>However, I may be taking too-conservative a stance here. It's
>possible that, while HTML comments can appear in ham, *long* HTML
>comments won't, and the fact that we're looking for long blocks of
>comment text is enough safety.

That's why feeling. You'll notice my rule is dumb: it's simply
looking for a bunch of stuff in a comment. My main feeling is that
if anyone is sending HTML email with LOTS of stuff commented out,
that email is almost certainly spam. Ham HTML email would probably
be done with more care.

Yes, there's the chance for FPs, if some company decides to send a
legitimate (ham, opt-in, etc.) HTML email from a badly-written
template where the designer was a lazy bum and left giant
commented-out sections... but would you really want such an email
anyway? ;-)

Thanks.

--- Amir


kdeugau at vianet

Aug 12, 2013, 7:37 AM

Post #23 of 24 (17 views)
Permalink
Re: New spam rule for specific content [In reply to]

Amir 'CG' Caspi wrote:
> My main feeling is that if anyone is
> sending HTML email with LOTS of stuff commented out, that email is
> almost certainly spam. Ham HTML email would probably be done with more
> care.

*snigger* Take a look at the raw source from a message sent with
Outlook (especially one with "stationery") and say that again...

I've had to heavily alter or outright discard a number of otherwise
useful rules along the lines discussed in this thread due to Outlook FPs.

-kgd


jhardin at impsec

Aug 12, 2013, 7:48 AM

Post #24 of 24 (16 views)
Permalink
Re: New spam rule for specific content [In reply to]

On Mon, 12 Aug 2013, Kris Deugau wrote:

> Amir 'CG' Caspi wrote:
>> My main feeling is that if anyone is
>> sending HTML email with LOTS of stuff commented out, that email is
>> almost certainly spam. Ham HTML email would probably be done with more
>> care.
>
> *snigger* Take a look at the raw source from a message sent with
> Outlook (especially one with "stationery") and say that again...
>
> I've had to heavily alter or outright discard a number of otherwise
> useful rules along the lines discussed in this thread due to Outlook FPs.

This was my worry, too.

In a word: "Microsoft"

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin [at] impsec FALaholic #11174 pgpk -a jhardin [at] impsec
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Liberals love sex ed because it teaches kids to be safe around their
sex organs. Conservatives love gun education because it teaches kids
to be safe around guns. However, both believe that the other's
education goals lead to dangers too terrible to contemplate.
-----------------------------------------------------------------------
3 days until the 68th anniversary of the end of World War II

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.