Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: users

Back on DNSBL overlap

 

 

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded


antispam at khopis

Nov 16, 2009, 4:26 PM

Post #1 of 3 (802 views)
Permalink
Back on DNSBL overlap

Warren reported:
>> SPAM% HAM% RANK RULE
>> 12.8342% 0.0021% 0.94 RCVD_IN_PSBL *
>> 12.3053% 0.0026% 0.94 RCVD_IN_XBL
>> 31.2499% 0.0827% 0.87 RCVD_IN_ANBREP_BL *2
>> 80.2578% 0.1485% 0.86 RCVD_IN_PBL
>> 27.1836% 0.1985% 0.79 RCVD_IN_SORBS_DUL
>> 19.8213% 0.1785% 0.79 RCVD_IN_SEMBLACK *
>> 90.9360% 0.3854% 0.77 RCVD_IN_BRBL_LASTEXT
>> 13.0564% 0.4838% 0.67 RCVD_IN_HOSTKARMA_BL *

Justin requested:
> any chance you could post the S/O ratios? RANK is a bit "unportable",
> as it depends on other rules in the ruleset at the time the
> measurement takes place.

I agree with Warren in that S/O isn't as useful here. Even SPAM% is
just a minimum threshold. With respect to this specific problem, HAM%
is the best indicator of the standard stats performed by MassCheck.

That said, I think that a measure of independence from the other
RCVD_IN_* rules is an even better metric. How many hits are unique
(among DNSBLs) to each DNSBL?

I do this kind of thing manually on occasion since the data is usually
at the bottom of the detail page for each rule, e.g. the last network
test (20091114-r836144-n, http://tinyurl.com/yfef2ef ) reveals:


86% of BRBL_LASTEXT appears in PBL
97% of PBL appears in BRBL_LASTEXT
The non-overlapping 3% of PBL means it's SPAM% is 2.4077

29% of BRBL_LASTEXT appears in SORBS_DUL
97% of SORBS_DUL appears in BRBL_LASTEXT
The non-overlapping 3% of SORBS_DUL means SPAM% = 0.8155

24% of BRBL_LASTEXT appears in SORBS_WEB
94% of SORBS_WEB appears in BRBL_LASTEXT
The non-overlapping 6% of SORBS_WEB means SPAM% = 1.4041

33% of BRBL_LASTEXT appears in ANBREP_BL
97% of ANBREP_BL appears in BRBL_LASTEXT
The non-overlapping 3% of ANBREP_BL means SPAM% = 0.9075

21% of BRBL_LASTEXT appears in SEMBLACK
96% of SEMBLACK appears in BRBL_LASTEXT
The non-overlapping 4% of SEMBLACK means SPAM% = 0.7129

20% of BRBL_LASTEXT appears in ANBREP_L3
98% of ANBREP_L3 appears in BRBL_LASTEXT
The non-overlapping 2% of ANBREP_L3 means SPAM% = 0.3725

(Fetched from other pages)
18% of BRBL_LASTEXT appears in BL_SPAMCOP_NET
95% of BL_SPAMCOP_NET appears in BRBL_LASTEXT
The non-overlapping 5% of BL_SPAMCOP_NET means SPAM% = 0.8711

33% of PBL appears in RCVD_IN_SORBS_DUL
97% of SORBS_DUL appears in PBL
The non-overlapping 3% of SORBS_DUL means SPAM% = 0.8155

I can't go a step further and figure out how much of SORBS_DUL is
hitting independent of /both/ BRBL and PBL other than it's between
0.0245 and 0.8155. Removing the 27% overlap with SEMBLACK reduces
that to between 0.0179 and 0.8155, ...

This is a two-way street; rather than using BRBL + PBL + SEMBLACK to
reduce SORBS_DUL, we can reduce BRBL: Removing just PBL from
BRBL_LASTEXT's impressive 91% match of the spam corpus reduces that to
12.7310. Removing the rest of the aforementioned DNSBLs could reduce
that down to as low as 2.3853.


So this brings me right back to my older point: DNSBLs catch the same
culprits even given completely independent spamtrap configurations.
While I've called this "incestuous" in the past, that term leads one
to the wrong conclusion. They're not syndicating each other (and if
they are, either they need to be called out on it or they need to both
have permission and also advertise that fact), they just attract the
same spammers because they use the same methods.

My hypothesis, which I've anecdotally proven on my own deployment, is
that the flaws are repeated as well. Spammers that trigger spamtraps
on multiple DNSBLs (and URIBLs) may be sending from (or linking to)
servers that also deal with legitimate traffic. This means that
thanks to these similar indexing techniques, DNSBL overlap from
spammers' abuse of a non-spam-exclusive server can single-handedly
mark a ham as spam.

My "solution" is to counter-intuitively *remove* points from message
that hit too many DNSBLs. They still net quite a positive score, but
that score is effectively capped at something not quite high enough to
kill a ham with DNSBLs alone.

A more elegant version of this, which Karsten and I theorize might
even happen automatically (as scored by the GA) if I were to check my
adjustor into SVN, would be to reduce most of the points on the DNSBLs
and add them back with a meta rule containing a union of the DNSBL
rules (without a "multiple" tflag).


wtogami at redhat

Nov 16, 2009, 7:22 PM

Post #2 of 3 (775 views)
Permalink
Re: Back on DNSBL overlap [In reply to]

On 11/16/2009 07:26 PM, Adam Katz wrote:
>
> My hypothesis, which I've anecdotally proven on my own deployment, is
> that the flaws are repeated as well. Spammers that trigger spamtraps
> on multiple DNSBLs (and URIBLs) may be sending from (or linking to)
> servers that also deal with legitimate traffic. This means that
> thanks to these similar indexing techniques, DNSBL overlap from
> spammers' abuse of a non-spam-exclusive server can single-handedly
> mark a ham as spam.
>
> My "solution" is to counter-intuitively *remove* points from message
> that hit too many DNSBLs. They still net quite a positive score, but
> that score is effectively capped at something not quite high enough to
> kill a ham with DNSBLs alone.
>
> A more elegant version of this, which Karsten and I theorize might
> even happen automatically (as scored by the GA) if I were to check my
> adjustor into SVN, would be to reduce most of the points on the DNSBLs
> and add them back with a meta rule containing a union of the DNSBL
> rules (without a "multiple" tflag).

I think there is a lot of merit to this approach, and it might even be a
great idea. But I spoke with a machine learning expert and heard some
interesting things on this topic.

We held a small workshop yesterday in which she explained Logistic
Regression and how it might be applied to automated rescoring of
spamassassin's rules. The most intriguing aspect of her explanation was
the suggestion of using a logarithmic function in weight scoring. I
asked specifically about this issue of overlap (like BRBL_LASTEXT with
every other list) and she suggested this particular method of rescoring
wouldn't have an issue with overlap.

I believe you mentioned logarithmic scoring in an earlier discussion?

It appears that we have a few very smart people interested in
implementing an alternative rescorer using Logistic Regression. We plan
on using an existing library for the bulk of this implementation.

I think we should proceed with our current generated scores for 3.3.0.
After that we can compare the effectiveness of different approaches
including your proposal.

Specifically on the issue of overlapping DNSBL's, there might be a few
possibilities:

* Overlapping DNSBL's really is no problem with any method of scoring.
* Overlapping DNSBL's is only a slight problem with any method of
scoring, but if a host is blacklisted with more than one major DNSBL
they have serious issues they need to fix and we shouldn't try to
workaround for their benefit.
* Overlapping DNSBL's is a real problem, but logarithmic scoring avoids
it as an issue.

rulesrc/sandbox/jm/20_bug_5984.cf:# score RCVD_IN_BRBL_LASTEXT 2.0

This apparently was set manually. It appears that spamassassin-3.2.x
was not scored when BRBL existed as a rule. Meanwhile our new GA scores
resulted in:

score RCVD_IN_BRBL_LASTEXT 0 1.644 0 1.449 # n=0 n=2

This is relatively modest. This combined with one other DNSBL alone
will not push it clearly above 5 points. I might suggest manually
adjusting down BRBL or PBL so it requires one additional tiny score to
push it over the edge. I'm personally comfortable enough to outright
reject mail from a Spamhaus listed host. Given this bias, it is
sufficiently cautious in my book to accept PBL + BRBL as insufficient.

Warren Togami
wtogami [at] redhat


jm at jmason

Nov 18, 2009, 12:42 PM

Post #3 of 3 (755 views)
Permalink
Re: Back on DNSBL overlap [In reply to]

can't type much as i've broken my elbow (oh noes!) -- but we talked in
the past about using an LR engine for rescoring. not sure if that got
anywhere though.

btw be aware also that there was a perceptron rescorer, but it
produced more fragile scores than the ga; see 3.2.0 rescoring ticket
for history

--j

On Tue, Nov 17, 2009 at 03:22, Warren Togami <wtogami [at] redhat> wrote:
> On 11/16/2009 07:26 PM, Adam Katz wrote:
>>
>> My hypothesis, which I've anecdotally proven on my own deployment, is
>> that the flaws are repeated as well.  Spammers that trigger spamtraps
>> on multiple DNSBLs (and URIBLs) may be sending from (or linking to)
>> servers that also deal with legitimate traffic.  This means that
>> thanks to these similar indexing techniques, DNSBL overlap from
>> spammers' abuse of a non-spam-exclusive server can single-handedly
>> mark a ham as spam.
>>
>> My "solution" is to counter-intuitively *remove* points from message
>> that hit too many DNSBLs.  They still net quite a positive score, but
>> that score is effectively capped at something not quite high enough to
>> kill a ham with DNSBLs alone.
>>
>> A more elegant version of this, which Karsten and I theorize might
>> even happen automatically (as scored by the GA) if I were to check my
>> adjustor into SVN, would be to reduce most of the points on the DNSBLs
>> and add them back with a meta rule containing a union of the DNSBL
>> rules (without a "multiple" tflag).
>
> I think there is a lot of merit to this approach, and it might even be a
> great idea.  But I spoke with a machine learning expert and heard some
> interesting things on this topic.
>
> We held a small workshop yesterday in which she explained Logistic
> Regression and how it might be applied to automated rescoring of
> spamassassin's rules.  The most intriguing aspect of her explanation was the
> suggestion of using a logarithmic function in weight scoring.  I asked
> specifically about this issue of overlap (like BRBL_LASTEXT with every other
> list) and she suggested this particular method of rescoring wouldn't have an
> issue with overlap.
>
> I believe you mentioned logarithmic scoring in an earlier discussion?
>
> It appears that we have a few very smart people interested in implementing
> an alternative rescorer using Logistic Regression.  We plan on using an
> existing library for the bulk of this implementation.
>
> I think we should proceed with our current generated scores for 3.3.0. After
> that we can compare the effectiveness of different approaches including your
> proposal.
>
> Specifically on the issue of overlapping DNSBL's, there might be a few
> possibilities:
>
> * Overlapping DNSBL's really is no problem with any method of scoring.
> * Overlapping DNSBL's is only a slight problem with any method of scoring,
> but if a host is blacklisted with more than one major DNSBL they have
> serious issues they need to fix and we shouldn't try to workaround for their
> benefit.
> * Overlapping DNSBL's is a real problem, but logarithmic scoring avoids it
> as an issue.
>
> rulesrc/sandbox/jm/20_bug_5984.cf:# score RCVD_IN_BRBL_LASTEXT 2.0
>
> This apparently was set manually.  It appears that spamassassin-3.2.x was
> not scored when BRBL existed as a rule.  Meanwhile our new GA scores
> resulted in:
>
> score RCVD_IN_BRBL_LASTEXT 0 1.644 0 1.449 # n=0 n=2
>
> This is relatively modest.  This combined with one other DNSBL alone will
> not push it clearly above 5 points.  I might suggest manually adjusting down
> BRBL or PBL so it requires one additional tiny score to push it over the
> edge.  I'm personally comfortable enough to outright reject mail from a
> Spamhaus listed host.  Given this bias, it is sufficiently cautious in my
> book to accept PBL + BRBL as insufficient.
>
> Warren Togami
> wtogami [at] redhat
>
>



--
--j.

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.