
antispam at khopis
Apr 19, 2010, 8:41 PM
Post #6 of 7
(1025 views)
Permalink
|
On 04/17/2010 03:30 PM, Alex wrote: > Some time ago you posted that you were investigating the stats and > effectiveness of a few rules in your masschecks sandbox, and thought > I would see if you had made any progress, and found anything > helpful? Yeah, analysis (and writing it up) is time-consuming and I was putting it off. Here it is. > On Mon, Nov 23, 2009 at 8:34 PM, Adam Katz <antispam [at] khopis> wrote: >> Unless there are objections, I'm going to add two tests to my sandbox: >> >> RCVD_IN_NIX_SPAM, a new (to us) DNSBL populated by the same source as >> the original [N]iXhash zone, with results on intra2net that look quite >> promising: 72.98:0.12 spam:ham (PSBL has 48.69:0.36), >> http://www.intra2net.com/ [...] DateRev SPAM% HAM% S/O RANK NAME 20091219 6.0855 0.0158 0.997 0.91 T_RCVD_IN_NIX_SPAM 20091226 6.6822 0.0171 0.997 0.91 T_RCVD_IN_NIX_SPAM 20100116 8.8194 0.0079 0.999 0.93 T_RCVD_IN_NIX_SPAM 20100123 9.6367 0.0060 0.999 0.94 T_RCVD_IN_NIX_SPAM Here are all the results ruleqa was willing to yield. I've removed the cases where there weren't about a million spams as the data for most rules is non-representative. After January, ruleqa stopped evaluating the rule (and RCVD_IN_SPAMCOP) altogether, so I'm not confident in the results as they never leveled out. Based on that performance, NiX performs quite well, but not at a level to justify including in SA proper as it just creates too much DNS traffic. Jari Fredricksson's recent Top "Ten Rules" post to the list has RCVD_IN_NIX_SPAM ranked 11th (he posted 20 rules, "Ten" was in the thread name) with 72.29% spam versus 16% ham at 0.998 S/O (total ham+spam corpus = 20293). Jari is in NE Europe, like this DNSBL's spamtrap fodder. My company gets over 17.6% spam on Nix as well. >> RCVD_IN_SPAMCOP, a fix-up of SpamCop to limit it to the last >> external relay (just like every other DNSBL used by SpamAssassin). This again only found four useful trials. The results show that SpamCop is indeed a well-maintained DNSBL with a very low FP rate, but it doesn't have the sheer volume of the others. DateRev SPAM% HAM% S/O RANK NAME 20091219 11.9204 0.0390 0.997 0.89 T_RCVD_IN_SPAMCOP 20091226 10.4777 0.0367 0.997 0.88 T_RCVD_IN_SPAMCOP 20100116 12.2375 0.0953 0.992 0.81 T_RCVD_IN_SPAMCOP 20100123 13.7493 0.0324 0.998 0.90 T_RCVD_IN_SPAMCOP Compared to the full parsing of headers: DateRev SPAM% HAM% S/O RANK NAME 20091219 57.4236 1.8637 0.969 0.62 RCVD_IN_BL_SPAMCOP_NET 20091226 57.1671 1.7706 0.970 0.62 RCVD_IN_BL_SPAMCOP_NET 20100116 58.6552 1.7156 0.972 0.62 RCVD_IN_BL_SPAMCOP_NET 20100123 59.0184 1.6012 0.974 0.62 RCVD_IN_BL_SPAMCOP_NET ... it would be a shame to strike spamcop, but it doesn't really seem like much of a player (because it doesn't use spamtraps). In fact, it's lack of spamtraps suggests keeping it because it's capable of listing spammers that successfully avoid spamtraps. Maybe I'll open a bug to use the lastexternal version instead of the current one. >> While digging around there, I noticed that SpamCop and ham rule >> RCVD_IN_BSP_TRUSTED are the only rules to use check_rbl_txt(), >> which affords it a nicer explanation of what triggered the spam. >> For a fully apples-to-apples comparison, my fix-up reverts back to >> plain-old check_rbl() ... which unfortunately means a second DNS >> lookup (since we're looking for an A record rather than a TXT >> record). >> >> Both will be marked "nopublish" until we have stats to motivate >> us. >> >> check_rbl_txt() gives quite informative data, and it's supported >> by every DNSBL I've tried (all below). RCVD_IN_NIX_SPAM supports >> it (though my test will avoid it until we can determine there isn't >> a bug in lookups here), as do BRBL and others. Assuming a lack of >> bugs or efficiency, we should probably use it for any index that >> doesn't contain multiple indices (like zen). I have no news on this front. That was more meant to be a question to the other developers. I suppose the TXT data is more verbose and therefore eats more bandwidth, so therefore SA doesn't use it?
|