Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: users

iXhash with minimum size

 

 

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded


antispam at khopis

Sep 26, 2009, 8:11 AM

Post #1 of 6 (458 views)
Permalink
iXhash with minimum size

Karsten Bräckelmann wrote:
> > This is a plain RE rule I once wrote, to limit some rule to really short
> > messages only.
> >
> > rawbody __KB_RAWBODY_200 /^.{0,200}$/s

Warren Togami mused:
> I suspect meta limiting Adam's IXHASH rules with a minimum size subrule
> would eliminate many of the IXHASH false positives. I was using his
> IXHASH plugin for a while, but stopped because I noticed too many FP's
> on short e-mails. I wonder if his IXHASH plugin is suitable to put into
> the sandbox for actual statistical testing.

Quick note - iXhash isn't mine. The project is the brainchild of Dirk
Bonengel, http://dbonengel.users.sourceforge.net/#, who was inspired by
NiX Spam (by Bert Ungerer). The credits at http://ixhash.sf.net/ don't
actually mention Dirk (Dirk -- take credit!).

I merely wrote that meta rule to link the three of them together rather
than the more common approach of assigning points to each of them.
Combining that with Karsten's rawbody check (though I'm not sure what char
length threshold would be a good one), we'd get (please unwrap meta line):

meta IXHASH_CHECK __KB_RAWBODY_200 && (GENERIC_IXHASH ||
NIXSPAM_IXHASH || CTYME_IXHASH || HOSTEUROPE_IXHASH)
describe IXHASH_CHECK BODY: MD5 checksum matches known spam
score IXHASH_CHECK 0 2 0 2

--
Adam Katz
khopesh on irc://irc.freenode.net/#spamassassin
http://khopesh.com/Anti-Spam


jhardin at impsec

Sep 26, 2009, 9:27 AM

Post #2 of 6 (443 views)
Permalink
Re: iXhash with minimum size [In reply to]

On Sat, 26 Sep 2009, Adam Katz wrote:

> Warren Togami mused:
>> I noticed too many FP's on short e-mails.
>
> Combining that with Karsten's rawbody check (though I'm not sure what char
> length threshold would be a good one), we'd get (please unwrap meta line):
>
> meta IXHASH_CHECK __KB_RAWBODY_200 && (GENERIC_IXHASH ||
> NIXSPAM_IXHASH || CTYME_IXHASH || HOSTEUROPE_IXHASH)

Shouldn't that be !__KB_RAWBODY_200 ?

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin [at] impsec FALaholic #11174 pgpk -a jhardin [at] impsec
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
North Korea: the only country in the world where people would risk
execution to flee to communist China. -- Ride Fast
-----------------------------------------------------------------------
Approximately 8887200 firearms legally purchased in the U.S. this year


hege at hege

Sep 26, 2009, 10:08 AM

Post #3 of 6 (442 views)
Permalink
Re: iXhash with minimum size [In reply to]

On Sat, Sep 26, 2009 at 11:11:05AM -0400, Adam Katz wrote:
> Karsten BrÃ?ckelmann wrote:
> > > This is a plain RE rule I once wrote, to limit some rule to really short
> > > messages only.
> > >
> > > rawbody __KB_RAWBODY_200 /^.{0,200}$/s
>
> Warren Togami mused:
> > I suspect meta limiting Adam's IXHASH rules with a minimum size subrule
> > would eliminate many of the IXHASH false positives. I was using his
> > IXHASH plugin for a while, but stopped because I noticed too many FP's
> > on short e-mails. I wonder if his IXHASH plugin is suitable to put into
> > the sandbox for actual statistical testing.
>
> Quick note - iXhash isn't mine. The project is the brainchild of Dirk
> Bonengel, http://dbonengel.users.sourceforge.net/#, who was inspired by
> NiX Spam (by Bert Ungerer). The credits at http://ixhash.sf.net/ don't
> actually mention Dirk (Dirk -- take credit!).

FYI..

Current iXhash has many bugs, which I noticed when I worked on my own
version with SA native DNS lookups.

One of the bigger problems of iXhash is probably of historical nature. There
is no decoding of messages (base64 etc).

Looking at method #1, which is supposed to apply on messages with 20
spaces and 2 newlines:

if (($body =~ /(?>\s.+?){20}/g) || ( $body =~ /\n.*\n/ ) ){

Since it's buggily OR'd instead of &&, it's enough that mail only has two
newlines. Especially short base64 messages are basically hashed from a few
newlines and equal signs, making even completely different contents into
same hashes.

When I fixed this, for some reason hash #1 was rarely generated on a mail.
It seems the /(?>\s.+?){20}/g clause seemed to match only when there are 20
whitespaces on the same line, which rarely happens. Anyways, making it
/(?:\s.+?){20}/s worked but some foreign mails made the RE hang for tens of
seconds. Rewrote it in completely different way..

If someone wants to have a look, here is my unofficial version. All the FPs
I got are practically gone.

http://sa.hege.li/iXhash2.pm
http://sa.hege.li/iXhash2.cf

I've let Dirk know about the bugs, we'll see what the future brings. Maybe a
real iXhash2 that actually does decoding etc. I'm sure there could be many
more enhancements, so I think this is good time for many eyes to give a
serious look at the REs and methods! Quite long time that these bugs were
unnoticed..


per at computer

Sep 28, 2009, 4:20 AM

Post #4 of 6 (433 views)
Permalink
Re: iXhash with minimum size [In reply to]

Henrik K wrote:

> Current iXhash has many bugs, which I noticed when I worked on my own
> version with SA native DNS lookups.
>
> One of the bigger problems of iXhash is probably of historical nature.
> There is no decoding of messages (base64 etc).
>
> Looking at method #1, which is supposed to apply on messages with 20
> spaces and 2 newlines:
>
> if (($body =~ /(?>\s.+?){20}/g) || ( $body =~ /\n.*\n/ ) ){
>
> Since it's buggily OR'd instead of &&

I think you've got some old code there. My ixhash plugin has this line
instead:

if (($body =~ /([\s\t].+?){20}/ ) && ($body =~ /.*$.*$.*/)) {

> When I fixed this, for some reason hash #1 was rarely generated on a
> mail. It seems the /(?>\s.+?){20}/g clause seemed to match only when
> there are 20 whitespaces on the same line, which rarely happens.

On my test-system since 2009-09-28 00:00 I have

hash#1 - 1529 generated.
hash#2 - 7740 generated
hash#3 - 820 generated


/Per Jessen, Zürich


hege at hege

Sep 28, 2009, 5:44 AM

Post #5 of 6 (434 views)
Permalink
Re: iXhash with minimum size [In reply to]

On Mon, Sep 28, 2009 at 12:21:29PM +0200, Per Jessen wrote:
> Henrik K wrote:
>
> > Current iXhash has many bugs, which I noticed when I worked on my own
> > version with SA native DNS lookups.
> >
> > One of the bigger problems of iXhash is probably of historical nature.
> > There is no decoding of messages (base64 etc).
> >
> > Looking at method #1, which is supposed to apply on messages with 20
> > spaces and 2 newlines:
> >
> > if (($body =~ /(?>\s.+?){20}/g) || ( $body =~ /\n.*\n/ ) ){
> >
> > Since it's buggily OR'd instead of &&
>
> I think you've got some old code there. My ixhash plugin has this line
> instead:
>
> if (($body =~ /([\s\t].+?){20}/ ) && ($body =~ /.*$.*$.*/)) {

I only know of http://ixhash.sf.net/ which results in iXhash-1.5.5.zip.

So it seems you have an "unofficial" version, which really offers very
little improvement. Not only are those regex extremely slow (I just
benchmarked), but it also happily hashes all base64 messages with 20 lines,
which probably also generates some nice FPs.

Cheers,
Henrik


per at computer

Sep 28, 2009, 6:20 AM

Post #6 of 6 (434 views)
Permalink
Re: iXhash with minimum size [In reply to]

Henrik K wrote:

> On Mon, Sep 28, 2009 at 12:21:29PM +0200, Per Jessen wrote:
>> Henrik K wrote:
>>
>> > Current iXhash has many bugs, which I noticed when I worked on my
>> > own version with SA native DNS lookups.
>> >
>> > One of the bigger problems of iXhash is probably of historical
>> > nature. There is no decoding of messages (base64 etc).
>> >
>> > Looking at method #1, which is supposed to apply on messages with
>> > 20 spaces and 2 newlines:
>> >
>> > if (($body =~ /(?>\s.+?){20}/g) || ( $body =~ /\n.*\n/ ) ){
>> >
>> > Since it's buggily OR'd instead of &&
>>
>> I think you've got some old code there. My ixhash plugin has this
>> line instead:
>>
>> if (($body =~ /([\s\t].+?){20}/ ) && ($body =~ /.*$.*$.*/)) {
>
> I only know of http://ixhash.sf.net/ which results in
> iXhash-1.5.5.zip.

I guess I have an older version with the correct '&&'. Interesting.


/Per Jessen, Zürich

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.