Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: exim: users

MIME parts and sa-learn

 

 

exim users RSS feed   Index | Next | Previous | View Threaded


exim at sipwise

Jun 26, 2009, 6:14 AM

Post #1 of 3 (312 views)
Permalink
MIME parts and sa-learn

Hej,

I am thinking of a clever way to integrate spamassassin's sa-learn
(bayesian classifier training program) into exim's ACLs. The intended
approach is to pass the message which should be trained as a
"message/rfc822" attachment (so original headers are preserved) to a
specific address (e.g. sa-learn-spam[at]domain) at the server.

Therefore, the first thing I was looking at was the smtp_mime ACL, but
it doesn't seem to be of much use besides filtering for regular
expressions. If the "malware" condition would be allowed, I could pass
the attachment via a "cmdline" scanner to sa-learn, but according to the
docs this isn't possible.

It is of course possible to pass the whole message to a script in the
data ACL or in a transport and demime everything in the script, but I
don't really like that much.
Another approach would be to demime into unique files in the mime ACL
and read these files in a scanner/delivery script, but that's even
worse, IMO.

I'm sure people are using spamassassin a lot out there, so can anyone
here show a smarter way of integration spam/ham learning? (Without using
spamc or sth. else from the user side.)

TIA & br,
daniel

--
## List details at http://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/


wbh at conducive

Jun 26, 2009, 8:36 AM

Post #2 of 3 (288 views)
Permalink
Re: MIME parts and sa-learn [In reply to]

Daniel Tiefnig wrote:
> Hej,
>
> I am thinking of a clever way to integrate spamassassin's sa-learn
> (bayesian classifier training program) into exim's ACLs. The intended
> approach is to pass the message which should be trained as a
> "message/rfc822" attachment (so original headers are preserved) to a
> specific address (e.g. sa-learn-spam[at]domain) at the server.
>
> Therefore, the first thing I was looking at was the smtp_mime ACL, but
> it doesn't seem to be of much use besides filtering for regular
> expressions. If the "malware" condition would be allowed, I could pass
> the attachment via a "cmdline" scanner to sa-learn, but according to the
> docs this isn't possible.
>
> It is of course possible to pass the whole message to a script in the
> data ACL or in a transport and demime everything in the script, but I
> don't really like that much.
> Another approach would be to demime into unique files in the mime ACL
> and read these files in a scanner/delivery script, but that's even
> worse, IMO.
>
> I'm sure people are using spamassassin a lot out there, so can anyone
> here show a smarter way of integration spam/ham learning? (Without using
> spamc or sth. else from the user side.)
>
> TIA & br,
> daniel
>

CAVEAT: Take this as a 'contrarian' observation w/r auto-learning and local
server spam/ham classifying in general.

- IMNSHO, trying to 'learn' spam/ham discrimination on a mixed-user server has
two drawbacks:

-- It uses a great deal of machine resources compared to a multitude of simpler
and more repeatable/predictable means of filtering.

-- it can be confused by per-user differences, not only as to what one user
consders spam and another does not (quasi-legit adverts, supermarket, bookstore,
airline and travel 'bargains' etc), and the very nature of the traffic different
users expect (active in social networks, retired vs active business contacts,
family & friends vs professionals, et al).

So Spam-Bayes and friends can easily get it 'wrong' if applied system-wide, yet
may need even greater resources if they are to be applied per-recipient - not
easily done in the requisite DATA phase anyway - at least not as to rejection vs
mere demerit scoring.

Conversely, Bayesian filtering seems to be at its best when applied in the
end-user's MUA, where there it is always 'per-recipient' specific, AND has at
least 'momentary' access to a generally greater chunk of processing power than a
server might be able to spare at busy times.

Next is the general 'need' to reinvent the classification anyway. It might have
a better payoff to utilize SA for all EXCEPT Bayesian / 'learning'/ AWL, and
add, for example, DSPAM, wherein a broader global dataset of spam vs ham
'fingerprints' can be applied with less total effort than developing your own on
the fly.

Either way, our experience has been that there is more than enough information
available to identify the unwanted so as to not need either SA's Spam-Bayes or
DPSAM.

Messages that evade interception by simpler means are few enough to not justify
the extra complexity - and maintenance - otherwise required, even when plentiful
machine resources are on-tap.

Note the relatively modest scores SA assigns by default even when SA-Spam-Bayes
is used. Not really in the front lines of defense - though one can, of course,
make it such.

Finally, to the extent that all other filters are working well, AND rejecting
in-session, not just scoring and onpassing, there can be a scarcity of spam on
which to train Bayesian filtering. Carrying such traffic 'deeper' into DATA
phase, so that Bayes can 'sniff' it to broaden its dataset, also adds workload
when it could have been rejected earlier.

After extensive tests, including saving folders full of known-spam for training,
we've given it up as too marginal to be useful, (ditto greylisting), and have
now had Spam-Bayes switched OFF for many years.

As said, a 'contrarian' viewpoint, so YMMV.

Bill


--
## List details at http://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/


exim at sipwise

Jun 28, 2009, 3:59 PM

Post #3 of 3 (271 views)
Permalink
Re: MIME parts and sa-learn [In reply to]

W B Hacker wrote:
> - IMNSHO, trying to 'learn' spam/ham discrimination on a mixed-user
> server has two drawbacks:
>
> -- It uses a great deal of machine resources compared to a multitude
> of simpler and more repeatable/predictable means of filtering.

Which there are ...

> -- it can be confused by per-user differences, not only as to what
> one user consders spam and another does not

I didn't write I'd like to share bayes data between users ... Although I
will do. ;-)
But my user base is very small (i.e. only a few employees) and maintains
a close relationship. I wouldn't do that on a large-scale server with
lots of different users. (Like for ISP systems which I have
administrated a lot, but no longer do.) And finally, I'm still
experimenting a bit, and will disable bayes classifying if it doesn't
work out well.

> So Spam-Bayes and friends can easily get it 'wrong' if applied
> system-wide, yet may need even greater resources if they are to be
> applied per-recipient - not easily done in the requisite DATA phase
> anyway - at least not as to rejection vs mere demerit scoring.

Well, this is a problem of content based filtering in general.

> Conversely, Bayesian filtering seems to be at its best when applied
> in the end-user's MUA, where there it is always 'per-recipient'
> specific, AND has at least 'momentary' access to a generally greater
> chunk of processing power than a server might be able to spare at
> busy times.

Sure, but client-side filtering has other drawbacks, e.g. when using
different clients like laptops, workstations, mobile phones ...
Generally, it counterfeights the idea of IMAP-based mail access.

> Next is the general 'need' to reinvent the classification anyway. It
> might have a better payoff to utilize SA for all EXCEPT Bayesian /
> 'learning'/ AWL, and add, for example, DSPAM, wherein a broader
> global dataset of spam vs ham 'fingerprints' can be applied with less
> total effort than developing your own on the fly.

Thanks for the hint, I tried dspam a few years ago and liked it very
much, but for the moment I don't want to maintain another spam filter
besides spamassassin. At some time I might decide to use dspam to
replace spamassassin at all, but not to run two different scanners which
will produce contradicting results.

> As said, a 'contrarian' viewpoint, so YMMV.

Thank you, your thoughts are very appreciated.

br,
daniel

--
## List details at http://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/

exim users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.