
wbh at conducive
Jun 26, 2009, 8:36 AM
Post #2 of 3
(288 views)
Permalink
|
Daniel Tiefnig wrote: > Hej, > > I am thinking of a clever way to integrate spamassassin's sa-learn > (bayesian classifier training program) into exim's ACLs. The intended > approach is to pass the message which should be trained as a > "message/rfc822" attachment (so original headers are preserved) to a > specific address (e.g. sa-learn-spam[at]domain) at the server. > > Therefore, the first thing I was looking at was the smtp_mime ACL, but > it doesn't seem to be of much use besides filtering for regular > expressions. If the "malware" condition would be allowed, I could pass > the attachment via a "cmdline" scanner to sa-learn, but according to the > docs this isn't possible. > > It is of course possible to pass the whole message to a script in the > data ACL or in a transport and demime everything in the script, but I > don't really like that much. > Another approach would be to demime into unique files in the mime ACL > and read these files in a scanner/delivery script, but that's even > worse, IMO. > > I'm sure people are using spamassassin a lot out there, so can anyone > here show a smarter way of integration spam/ham learning? (Without using > spamc or sth. else from the user side.) > > TIA & br, > daniel > CAVEAT: Take this as a 'contrarian' observation w/r auto-learning and local server spam/ham classifying in general. - IMNSHO, trying to 'learn' spam/ham discrimination on a mixed-user server has two drawbacks: -- It uses a great deal of machine resources compared to a multitude of simpler and more repeatable/predictable means of filtering. -- it can be confused by per-user differences, not only as to what one user consders spam and another does not (quasi-legit adverts, supermarket, bookstore, airline and travel 'bargains' etc), and the very nature of the traffic different users expect (active in social networks, retired vs active business contacts, family & friends vs professionals, et al). So Spam-Bayes and friends can easily get it 'wrong' if applied system-wide, yet may need even greater resources if they are to be applied per-recipient - not easily done in the requisite DATA phase anyway - at least not as to rejection vs mere demerit scoring. Conversely, Bayesian filtering seems to be at its best when applied in the end-user's MUA, where there it is always 'per-recipient' specific, AND has at least 'momentary' access to a generally greater chunk of processing power than a server might be able to spare at busy times. Next is the general 'need' to reinvent the classification anyway. It might have a better payoff to utilize SA for all EXCEPT Bayesian / 'learning'/ AWL, and add, for example, DSPAM, wherein a broader global dataset of spam vs ham 'fingerprints' can be applied with less total effort than developing your own on the fly. Either way, our experience has been that there is more than enough information available to identify the unwanted so as to not need either SA's Spam-Bayes or DPSAM. Messages that evade interception by simpler means are few enough to not justify the extra complexity - and maintenance - otherwise required, even when plentiful machine resources are on-tap. Note the relatively modest scores SA assigns by default even when SA-Spam-Bayes is used. Not really in the front lines of defense - though one can, of course, make it such. Finally, to the extent that all other filters are working well, AND rejecting in-session, not just scoring and onpassing, there can be a scarcity of spam on which to train Bayesian filtering. Carrying such traffic 'deeper' into DATA phase, so that Bayes can 'sniff' it to broaden its dataset, also adds workload when it could have been rejected earlier. After extensive tests, including saving folders full of known-spam for training, we've given it up as too marginal to be useful, (ditto greylisting), and have now had Spam-Bayes switched OFF for many years. As said, a 'contrarian' viewpoint, so YMMV. Bill -- ## List details at http://lists.exim.org/mailman/listinfo/exim-users ## Exim details at http://www.exim.org/ ## Please use the Wiki with this list - http://wiki.exim.org/
|