Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: devel

[Bug 6155] generate new scores for 3.3.0 release

 

 

First page Previous page 1 2 3 4 5 6 7 Next page Last page  View All SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded


bugzilla-daemon at bugzilla

Oct 22, 2009, 1:47 PM

Post #101 of 165 (2327 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #123 from Adam Katz <antispam [at] khopis> 2009-10-22 13:47:40 UTC ---
(In reply to comment #122)
sorry, that should be:

elinks -dump http://ruleqa.spamassassin.org/ |perl -ne
'print if /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/'
|tee rules.txt

for rule in $(perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }'
< rules.txt); do grep "^[^#]* $rule " /tmp/50_scores_newest.cf ||
echo "score $rule UNKNOWN"; done

With each of those two stanzas living on just one line.

Obviously, ignore the genuine ham rules.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 7:49 AM

Post #102 of 165 (2295 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Mark Martinec <Mark.Martinec [at] ijs> changed:

What |Removed |Added
----------------------------------------------------------------------------
Attachment #4542|0 |1
is obsolete| |
Attachment #4553|0 |1
is obsolete| |

--- Comment #124 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-26 07:49:13 UTC ---
Created an attachment (id=4558)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4558)
resulting 50_scores.cf from garescorer runs - V3

Attached is the latest 50_scores.cf file, obtained in a couple of iterations
during the last few days. It takes into account the updated results files
from the rsync submit area, in particular the updated net-wt* (Comment 99,
102, 103), and net-hege* files. The binnocenti* are still excluded.
The rest of the corpora tweaks/decimation as per my previous run, Comment 96.

The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101),
otherwise the _MED stands out above the _HI due to its significantly higher
hit rate.

The KB_RATWARE_OUTLOOK_08, KB_RATWARE_OUTLOOK_12, KB_RATWARE_OUTLOOK_16
and KB_RATWARE_BOUNDARY were now zeroed-out according to Comment 115.

I tried leaving RDNS_NONE and RDNS_DYNAMIC floating (Comment 116, 120, 122),
and it seems to me the obtained score is perfectly sensible and useful,
and still not too high to punish incompetent admins too hard:
score RDNS_NONE 0 1.1 0 0.7
score RDNS_DYNAMIC 0 0.5 0 0.5
so I'm leaving these floating.

According to Comment 122 I zeroed out (actually, 0.001'd out) the
HTML_MESSAGE, MIME_QP_LONG_LINE, FREEMAIL_FROM, TVD_SPACE_RATIO,
and MSGID_MULTIPLE_AT.

Some further tweaks: I reduced the BAYES scores somewhat (e.g. from 4.5
to 3.5 for BAYES_99 scoreset3) and tamed down the BAYES_50, which was
standing out from the crowd).

For DCC_* rules I used the already described approach: obtain DCC_CHECK score
from a GA run with all DCC_REPUT_* zeroed-out, then fix the obtained DCC_CHECK,
and let DCC_REPUT_* float for the final run.

The NML_ADSP_CUSTOM_MED was obtained from a GA run, but other (_LOW, _HIGH)
were set manually (currently no hits). The DKIM_ADSP_ALL, DKIM_ADSP_DISCARD,
and DKIM_ADSP_NXDOMAIN are based on GA runs, but hand-tweaked somewhat due
to inconsistencies between corpora.

A word about JM_SOUGHT_FRAUD_{1,2,3}: these three rules come out from
a ga RUN with scores between 2 and 3, but are somewhat inconsistent
between runs and corpora. As requested by Comment 38 their scores
were fixed at zero for the final run, but I'd set these manually
to 2.2 each for the published 50_scores.cf.

After preparing my manual fixes from a couple of trial runs, I made a
final run for each scoreset with these fixed scores, so as to allow other
scores to adjust themselves to the new constraints.

So here are the manual fixes:

score SPF_PASS -0.001
score SPF_HELO_PASS -0.001

score BAYES_00 0 0 -1.2 -1.9
score BAYES_05 0 0 -0.2 -0.5
score BAYES_20 0 0 -0.001 -0.001
score BAYES_40 0 0 -0.001 -0.001
score BAYES_50 0 0 2.0 0.8
score BAYES_60 0 0 2.5 1.5
score BAYES_80 0 0 2.7 2.0
score BAYES_95 0 0 3.2 3.0
score BAYES_99 0 0 3.8 3.5

score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1
score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2
score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8

score HTML_MESSAGE 0.001
score NO_RELAYS -0.001
score UNPARSEABLE_RELAY 0.001
score NO_RECEIVED -0.001
score NO_HEADERS_MESSAGE 0.001

score DKIM_ADSP_ALL 0 1.1 0 0.8
score DKIM_ADSP_DISCARD 0 1.8 0 1.8
score DKIM_ADSP_NXDOMAIN 0 0.8 0 0.9
score NML_ADSP_CUSTOM_LOW 0 0.7 0 0.7
score NML_ADSP_CUSTOM_MED 0 1.2 0 0.9
score NML_ADSP_CUSTOM_HIGH 0 2.6 0 2.5

score JM_SOUGHT_FRAUD_1 0
score JM_SOUGHT_FRAUD_2 0
score JM_SOUGHT_FRAUD_3 0

score MIME_QP_LONG_LINE 0.001
score FREEMAIL_FROM 0.001
score TVD_SPACE_RATIO 0.001
score MSGID_MULTIPLE_AT 0.001
score EXTRA_MPART_TYPE 1.0
score RDNS_NONE 0 1.1 0 0.7
score RDNS_DYNAMIC 0 0.5 0 0.5

score KB_RATWARE_OUTLOOK_08 0
score KB_RATWARE_OUTLOOK_12 0
score KB_RATWARE_OUTLOOK_16 0
score KB_RATWARE_BOUNDARY 0

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 8:01 AM

Post #103 of 165 (2301 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #125 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-26 08:00:59 UTC ---
$ head test scores

=================================
score set 3 (net, bayes) - gen-set3-20-5.0-12200-ga

test (10%)
# SUMMARY for threshold 5.0:
# Correctly non-spam: 21172 99.93%
# Correctly spam: 43597 98.78%
# False positives: 14 0.07%
# False negatives: 537 1.22%
# TCR(l=50): 35.678254 SpamRecall: 98.783% SpamPrec: 99.968%

scores (90%):
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168143 32.193% (99.979% of non-spam corpus)
# Correctly spam: 349734 66.961% (98.763% of spam corpus)
# False positives: 36 0.007% (0.021% of nonspam, 8360 weighted)
# False negatives: 4382 0.839% (1.237% of spam, 14401 weighted)
# Average score for spam: 21.1 nonspam: -2.2
# Average for false-pos: 5.5 false-neg: 3.3
# TOTAL: 522295 100.00%

=================================
score set 2 (no net, bayes) - gen-set2-10-5.0-12200-ga

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 21148 99.82%
# Correctly spam: 41172 93.29%
# False positives: 38 0.18%
# False negatives: 2962 6.71%
# TCR(l=50): 9.077334 SpamRecall: 93.289% SpamPrec: 99.908%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 167953 32.157% (99.866% of non-spam corpus)
# Correctly spam: 329931 63.169% (93.170% of spam corpus)
# False positives: 226 0.043% (0.134% of nonspam, 26882 weighted)
# False negatives: 24185 4.631% (6.830% of spam, 89229 weighted)
# Average score for spam: 10.8 nonspam: -0.7
# Average for false-pos: 5.6 false-neg: 3.7
# TOTAL: 522295 100.00%

=================================
score set 1 (net, no bayes) - gen-set1-10-5.0-12201-ga

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 21155 99.85%
# Correctly spam: 43153 97.78%
# False positives: 31 0.15%
# False negatives: 981 2.22%
# TCR(l=50): 17.437377 SpamRecall: 97.777% SpamPrec: 99.928%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168012 32.168% (99.901% of non-spam corpus)
# Correctly spam: 346216 66.287% (97.769% of spam corpus)
# False positives: 167 0.032% (0.099% of nonspam, 20194 weighted)
# False negatives: 7900 1.513% (2.231% of spam, 23052 weighted)
# Average score for spam: 19.8 nonspam: -0.5
# Average for false-pos: 5.7 false-neg: 2.9
# TOTAL: 522295 100.00%

=================================
score set 0 (no net, no bayes) - gen-set0-5-5.0-12201-ga

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 20919 98.74%
# Correctly spam: 34081 77.22%
# False positives: 267 1.26%
# False negatives: 10053 22.78%
# TCR(l=50): 1.885827 SpamRecall: 77.222% SpamPrec: 99.223%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 166261 31.833% (98.860% of non-spam corpus)
# Correctly spam: 271409 51.965% (76.644% of spam corpus)
# False positives: 1918 0.367% (1.140% of nonspam, 126535 weighted)
# False negatives: 82707 15.835% (23.356% of spam, 235514 weighted)
# Average score for spam: 10.4 nonspam: 0.6
# Average for false-pos: 6.3 false-neg: 2.8
# TOTAL: 522295 100.00%

=================================




In summary:
set 3
# False positives: 36 (0.021% of nonspam)
# False negatives: 4382 (1.237% of spam)

set 2
# False positives: 226 (0.134% of nonspam)
# False negatives: 24185 (6.830% of spam)

set 1
# False positives: 167 (0.099% of nonspam)
# False negatives: 7900 (2.231% of spam)

set 0
# False positives: 1918 (1.140% of nonspam)
# False negatives: 82707 (23.356% of spam)

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 8:08 AM

Post #104 of 165 (2314 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #126 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-26 08:08:22 UTC ---
Created an attachment (id=4559)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4559)
freqs.full of corpora used for score set 3 and 2 runs

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 8:09 AM

Post #105 of 165 (2304 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #127 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-26 08:09:26 UTC ---
Created an attachment (id=4560)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4560)
ranges.data on corpora used for score set 3 and 2 runs

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 9:57 AM

Post #106 of 165 (2298 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #128 from Karsten Bräckelmann <guenther [at] rudersport> 2009-10-26 09:57:28 UTC ---
(In reply to comment #124)
> Created an attachment (id=4558)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4558) [details]
> resulting 50_scores.cf from garescorer runs - V3

Now I am getting really nervous. :-/ From the scores:

score KB_DATE_CONTAINS_TAB 3.799 3.799 3.315 2.871
score KB_FAKED_THE_BAT 1.447 2.273 2.452 3.799

The bad thing about this is, that onet.pl / onet.eu (a polish free-mailer
AFAIK) actually munges the header, and injects the tab into the Date header on
their outgoing SMTP servers. Apparently, they do that harm to all outgoing
mail, not limited to their web-mailer.

It is a very, very stupid thing to do for them, to munge MUA generated headers
like that, but still they appear to do it. :( That means their customers will
really be punished, and using them *and* The Bat! is a killer.

FWIW, I once wrote these to counter a flood of low-scoreres -- but the above
scores are scaring me. This is quite bad.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 10:37 AM

Post #107 of 165 (2307 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #129 from Matthias Leisi <matthias [at] leisi> 2009-10-26 10:36:56 UTC ---
(In reply to comment #124)

> The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101),
> otherwise the _MED stands out above the _HI due to its significantly higher
> hit rate.
> [..]
>
> score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1
> score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2
> score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8

Is there a particular reason why these are so much different from those in
https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf:

| score RCVD_IN_DNSWL_LOW 0 -1 0 -1
| score RCVD_IN_DNSWL_MED 0 -4 0 -4
| score RCVD_IN_DNSWL_HI 0 -8 0 -8

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 11:03 AM

Post #108 of 165 (2294 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #130 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-26 11:03:28 UTC ---
> > The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101),
> > otherwise the _MED stands out above the _HI due to its significantly higher
> > hit rate.
> > score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1
> > score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2
> > score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8
>
> Is there a particular reason why these are so much different from those in
> https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf:
>
> | score RCVD_IN_DNSWL_LOW 0 -1 0 -1
> | score RCVD_IN_DNSWL_MED 0 -4 0 -4
> | score RCVD_IN_DNSWL_HI 0 -8 0 -8

The -1/-4/-8 were manually provided (don't know the background on this
decision).

The RCVD_IN_DNSWL_MED in my GA results was obtained automatically, and the
other two were manually adjusted to make some sense compared to _MED.
Btw, the GA results on scoreset 3 from one of my previous runs were:
RCVD_IN_DNSWL_LOW -2.761
RCVD_IN_DNSWL_MED -0.999
RCVD_IN_DNSWL_HI -0.966

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 11:36 AM

Post #109 of 165 (2301 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #131 from Matthias Leisi <matthias [at] leisi> 2009-10-26 11:36:22 UTC ---
(In reply to comment #130)

> The -1/-4/-8 were manually provided (don't know the background on this
> decision).

Other whitelisting rules (HABEAS_*, RCVD_IN_IADB_*, RCVD_IN_BSP_TRUSTED etc)
have the same scores as in the previous 50_scores.cf.

I was wondering why the dnswl.org rules have specifically lower scores than in
previous versions - and extremely low scores. This is worrying me, as it would
indicate we have a quality issue in the dnswl.org data.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 12:26 PM

Post #110 of 165 (2285 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #132 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-26 12:26:49 UTC ---
> Other whitelisting rules (HABEAS_*, RCVD_IN_IADB_*, RCVD_IN_BSP_TRUSTED etc)
> have the same scores as in the previous 50_scores.cf.

They do not have the same scores, seems to me they are all mostly
much lower. Please ignore the comments in 50_scores_newest3.cf,
just take into account uncommented 'score' lines:

score HABEAS_ACCREDITED_COI 0
score HABEAS_ACCREDITED_SOI 0 -1.634 0 -0.475

score RCVD_IN_BSP_TRUSTED 0 -0.001 0 -0.001

score RCVD_IN_IADB_DK 0 -0.044 0 -0.001
score RCVD_IN_IADB_DOPTIN 0
score RCVD_IN_IADB_DOPTIN_GT50 0
score RCVD_IN_IADB_DOPTIN_LT50 0 -0.001 0 -0.001
score RCVD_IN_IADB_EDDB 0
score RCVD_IN_IADB_EPIA 0
score RCVD_IN_IADB_GOODMAIL 0
score RCVD_IN_IADB_LISTED 0 -1.144 0 -0.001
score RCVD_IN_IADB_LOOSE 0
score RCVD_IN_IADB_MI_CPEAR 0
score RCVD_IN_IADB_MI_CPR_30 0
score RCVD_IN_IADB_MI_CPR_MAT 0 -0.079 0 -0.001
score RCVD_IN_IADB_ML_DOPTIN 0
score RCVD_IN_IADB_NOCONTROL 0
score RCVD_IN_IADB_OOO 0
score RCVD_IN_IADB_OPTIN 0 -3.265 0 -2.791
score RCVD_IN_IADB_OPTIN_GT50 0 -0.219 0 -1.041
score RCVD_IN_IADB_OPTIN_LT50 0
score RCVD_IN_IADB_OPTOUTONLY 0
score RCVD_IN_IADB_RDNS 0 -0.018 0 -0.001
score RCVD_IN_IADB_SENDERID 0 -0.001 0 -0.001
score RCVD_IN_IADB_SPF 0 -0.006 0 -0.042
score RCVD_IN_IADB_UNVERIFIED_1 0
score RCVD_IN_IADB_UNVERIFIED_2 0
score RCVD_IN_IADB_UT_CPEAR 0
score RCVD_IN_IADB_UT_CPR_30 0
score RCVD_IN_IADB_UT_CPR_MAT 0 -0.001 0 -0.052
score RCVD_IN_IADB_VOUCHED 0 -1.718 0 -0.956

score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1
score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2
score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8


> I was wondering why the dnswl.org rules have specifically lower scores than in
> previous versions - and extremely low scores. This is worrying me, as it would
> indicate we have a quality issue in the dnswl.org data.

These all have pretty low rank:

$ grep RCVD_IN_DNSWL_ freqs.full
OVERALL SPAM% HAM% S/O RANK SCORE NAME
0.184 0.0005 0.5708 0.001 0.76 -1.80 RCVD_IN_DNSWL_HI
7.410 0.1094 22.7527 0.005 0.67 -1.20 RCVD_IN_DNSWL_MED
2.551 0.1810 7.5322 0.023 0.59 -1.10 RCVD_IN_DNSWL_LOW

the _HI gets a low automatic score probably because it hits very little mail,
so it probably needs manual tweaking. The _MED seems to hit too many spam
messages in the submitted logs for rescoring runs, or perhaps it has a high
overlap with other similar rules.

It is quite possible that some of these hits are still false positives,
despite several iterations of cleaning:

for j in spam*.log; do echo -n $j; grep RCVD_IN_DNSWL_HI $j | \
wc -l; done | sort -k2nr

spam-bayes-net-bb-jhardin.log 3
spam-bayes-net-bb-kmcgrail.log 2
spam-bayes-net-bb-guenther_fraud.log 1
spam-bayes-net-hege.log 1

same on _MED:

spam-bayes-net-bluestreak.log 381
spam-bayes-net-hege.log 79
spam-bayes-net-bb-jhardin.log 23
spam-bayes-net-wt-en1.log 15
spam-bayes-net-bb-kmcgrail.log 14
spam-bayes-net-jm-decimated.log 11
spam-bayes-net-ahenry.log 9
spam-bayes-net-dos-decimated.log 6
spam-bayes-net-bb-zmi.log 3
spam-bayes-net-mmartinec.log 3
spam-bayes-net-wt-en4.log 2

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 1:52 PM

Post #111 of 165 (2286 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #133 from Justin Mason <jm [at] jmason> 2009-10-26 13:51:54 UTC ---
strange, some of the more trustworthy BLs are very low scoring.

RCVD_IN_XBL: 0.404 and 0.722

these have been effectively zeroed, although are supposed to be immutable:
RCVD_IN_SSC_TRUSTED_COI is 0 (with a 0.012 S/O, low hit rate though)
HABEAS_ACCREDITED_COI is 0 (ditto)
RCVD_IN_BSP_TRUSTED is -0.001 (although with a 0.002 S/O)

the HASHCASH rules likewise aren't supposed to be mutable.

it looks like there might be a bit of a problem there -- definitely some rules
that are in immutable sections, like the above, have been allowed to be mutable
in ranges.data....

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 2:31 PM

Post #112 of 165 (2286 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #134 from John Hardin <jhardin [at] impsec> 2009-10-26 14:31:20 UTC ---
(In reply to comment #132)

> $ grep RCVD_IN_DNSWL_ freqs.full
> OVERALL SPAM% HAM% S/O RANK SCORE NAME
> 0.184 0.0005 0.5708 0.001 0.76 -1.80 RCVD_IN_DNSWL_HI
> 7.410 0.1094 22.7527 0.005 0.67 -1.20 RCVD_IN_DNSWL_MED
> 2.551 0.1810 7.5322 0.023 0.59 -1.10 RCVD_IN_DNSWL_LOW
>
> It is quite possible that some of these hits are still false positives,
> despite several iterations of cleaning:
>
> for j in spam*.log; do echo -n $j; grep RCVD_IN_DNSWL_HI $j | \
> wc -l; done | sort -k2nr
>
> spam-bayes-net-bb-jhardin.log 3
>
> same on _MED:
>
> spam-bayes-net-bb-jhardin.log 23

All but one of those are obvious spams, and I've removed the one questionable
one from my corpora.

Some of the spam in my corpora is from third parties. I do check it for correct
classification before uploading, but I was wondering: how does masscheck
determine the correct lastexternal for corpora containing messages from
multiple different networks? Or does it assume all of the messages in a given
contributor's corpora have the same network boundary? If the latter, I need to
remove those third-party messages from my spam corpora...

Might lastexternal confusion in the masschecks be contributing in some way to
the odd RCVD_IN_* score generation?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 26, 2009, 4:28 PM

Post #113 of 165 (2284 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #135 from Adam Katz <antispam [at] khopis> 2009-10-26 16:27:56 UTC ---
Created an attachment (id=4561)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4561)
Checker for rules that match more ham than spam

I've updated my checker to an actual perl script (still uses elinks as I don't
feel like learning LWP and then parsing HTML). I've attached the checker,
which can be run with custom parameters for a different ruleset, ham threshold,
or minimum difference for ham:spam ratio. Here's the current output, listing
all rules that hit 1+% of the ham corpus or that hit 0.05% more of the ham
corpus than of the spam corpus.

H^2/S HAM% SPAM% Score in attachment 4558 Rule
331.9 0.3319 0.0010 0 OBSCURED_EMAIL
117.4 4.8566 0.2009 -0.001 SPF_HELO_PASS
88.52 5.5735 0.3509 -0.001 SPF_PASS
85.61 0.2226 0.0026 0.000 2.099 0.001 1.212 MISSING_MIME_HB_SEP
76.18 0.7085 0.0093 0.001 0.001 0.699 0.699 TVD_RCVD_SPACE_BRACKET
66.19 0.2780 0.0042 1.145 1.542 1.912 2.400 FUZZY_CPILL
49.98 1.0676 0.0228 0.001 MSGID_MULTIPLE_AT
31.82 0.1496 0.0047 1.494 1.699 1.591 1.516 X_IP
21.86 0.1465 0.0067 0 SUBJECT_FUZZY_TION
20.40 15.6218 11.9604 0.001 FREEMAIL_FROM
20.00* 40.9055 83.6301 0.001 HTML_MESSAGE
17.10 0.1710 0 1.222 0.001 0.082 0.476 MIME_BOUND_DIGITS_15
12.95 0.0609 0.0047 0 HTML_IFRAME_SRC
12.52 0.0714 0.0057 0 FORGED_IMS_TAGS
11.56 0.0659 0.0057 0.001 0.001 0.605 0.378 HTML_NONELEMENT_30_40
10.83 0.1127 0.0104 0.033 0.001 0.365 0.413 WEIRD_PORT
10.18 0.3494 0.0343 2.205 0.174 1.299 1.806 FRT_SOMA2
9.721 0.8934 0.0919 1.499 0.419 0.904 0.798 MIME_BASE64_BLANKS
8.996 0.2474 0.0275 0.987 0.750 0.943 1.318 CTYPE_001C_B
8.918 0.1525 0.0171 0.001 2.499 0.268 0.516 DRUGS_MUSCLE
8.373 0.0829 0.0099 0.003 0.978 0.100 1.515 TVD_FW_GRAPHIC_NAME_LONG
8.016 0.1956 0.0244 0.001 0.020 0.001 1.799 MIME_BASE64_TEXT
6.850 0.0685 0 0 HTML_NONELEMENT_40_50
5.404 0.5356 0.0991 0 1.200 0 2.514 SPF_HELO_FAIL
4.237 0.1585 0.0374 2.199 2.199 1.246 2.090 WEIRD_QUOTING
4.159 3.8908 3.6392 0.001 MIME_QP_LONG_LINE
3.483 0.8570 0.2460 1.799 0.572 1.182 1.138 HTML_IMAGE_RATIO_06
3.219 1.2399 0.4775 1.0 EXTRA_MPART_TYPE
2.913* 12.1047 50.2891 0 1.1 0 0.7 RDNS_NONE
2.839 0.1164 0.0410 0.001 2.185 1.936 0.476 FRT_SOMA
2.751 0.1172 0.0426 0.1 ANY_BOUNCE_MESSAGE
2.417 0.6787 0.2808 0.539 0.001 0.332 0.488 MIME_HTML_MOSTLY
2.370 0.1010 0.0426 0.1 BOUNCE_MESSAGE
2.078 0.5534 0.2663 1.899 0.496 0.950 0.445 HTML_IMAGE_RATIO_08
1.899 1.2077 0.7677 0.001 TVD_SPACE_RATIO
1.726 0.3227 0.1869 0.023 0.887 0.000 0.417 UPPERCASE_50_75
1.517 0.9658 0.6364 2.801 2.080 1.780 3.387 DATE_IN_PAST_96_XX
1.269 0.4224 0.3327 0.000 0.001 0.264 0.001 HTML_FONT_SIZE_LARGE
1.151 0.5492 0.4770 2.260 0.742 1.199 0.640 MPART_ALT_DIFF
0.913* 1.8488 3.7425 1.154 1.677 1.198 1.453 SUBJ_ALL_CAPS
0.703* 1.3317 2.5216 0.001 UNPARSEABLE_RELAY
0.278* 3.7480 50.4848 2.199 0.955 1.215 0.549 MIME_HTML_ONLY
0.121* 1.2540 12.9472 0 1.322 0 1.237 RCVD_IN_BL_SPAMCOP_NET

(Anything asterisked is included because it matched >1% of the ham corpus but
matched a larger percent of the spam corpus while everything else matched a
larger percent of the ham corpus than the spam corpus.)

Mark's fixes solved the immediate issues raised earlier, so I decided to order
this by the ratio of percentage of ham corpus hit to percentage of spam corpus
hit, but that under-emphasized the ham hits, so I then multiplied that by the
ham percentage again (unless the percent was under 1). It's easy enough to
browse for non-zero ham% hits.

Any rule with a ratio over 1.000 is a problem when scored positively unless it
is exempted for applying to popular spam patterns that the corpus is known to
lack. For completeness, this list includes all tests that hit at least 1% of
the ham corpus (thus the presence of HTML_MESSAGE, RDNS_NONE, and the four
tests with ratios under 1.0).

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 27, 2009, 7:09 AM

Post #114 of 165 (2251 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #136 from Justin Mason <jm [at] jmason> 2009-10-27 07:09:36 UTC ---
(In reply to comment #133)
> it looks like there might be a bit of a problem there -- definitely some rules
> that are in immutable sections, like the above, have been allowed to be mutable
> in ranges.data....

just wondering, Mark, did you do this deliberately? or is it just a bug in the
tool that it's ignoring the non-mutable flag for those rules for some reason?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 27, 2009, 2:18 PM

Post #115 of 165 (2240 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #137 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-27 14:18:14 UTC ---
> > it looks like there might be a bit of a problem there -- definitely some
> > rules that are in immutable sections, like the above, have been allowed
> > to be mutable in ranges.data....
>
> just wondering, Mark, did you do this deliberately? or is it just a bug
> in the tool that it's ignoring the non-mutable flag for those rules for
> some reason?

Sort-of deliberately. Initially I followed the idea in wiki RescoreMassCheck
section 4.2: 'comment out all "score" lines except for rules that you think
the scores are accurate like carefully-vetted net rules, or 0.001 informational
rules' which made perfect sense to me, so I did it for 50_scores.cf, except
for a couple of rather obvious rules like _WHITELIST and similar, and the ones
clearly indicated as 'indicators' only in the surrounding comments, or set to
0.001. Later I nailed a couple more. I followed a principle: when in doubt,
leave it floating, it can be fixed later if necessary. It gives some insight
into what GA 'thinks' about certain rules.

I think at least for some rules GA makes perfect sense, like RDNS_NONE
and RDNS_DYNAMIC. For some of them the GA result is close to the manually
assigned score, or may indicate a need for reconsidering the assigned score.
But I agree that more may need re-fixing.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 27, 2009, 2:29 PM

Post #116 of 165 (2233 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #138 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-27 14:29:03 UTC ---
(In reply to comment #134)
> Some of the spam in my corpora is from third parties. I do check it for correct
> classification before uploading, but I was wondering: how does masscheck
> determine the correct lastexternal for corpora containing messages from
> multiple different networks? Or does it assume all of the messages in a given
> contributor's corpora have the same network boundary? If the latter, I need to
> remove those third-party messages from my spam corpora...
>
> Might lastexternal confusion in the masschecks be contributing in some way to
> the odd RCVD_IN_* score generation?

I believe the masschecks leaves internal/external/msa_networks to their
defaults, unless one cares to configure it correctly for his corpus. And
I believe that it is more likely than not that some corpora were scanned
with unsuitable settings of networks. I know that configuring it for my
mass checks runs it gave me a headache (but I did it right in the end).
Which is why I posted the following note on the ML at that time:


From: Mark Martinec <Mark.Martinec+sa [at] ijs>
To: dev [at] spamassassin
Subject: Re: SpamAssassin 3.3.0 mass-checks now starting
Date: Fri, 4 Sep 2009 21:46:59 +0200

Docs don't say where one is supposed to put a local.cf with
options which are ignored in masses/spamassassin/user_prefs
(like Bayes SQL options, DCC, Pyzor timeouts etc).

I tried to place local.cf into masses/spamassassin/, with
horror results (some directives in local.cf proclaimed as
invalid, as apparently plugins have not yet been loaded
at the time of parsing this file, but only later).

I finally placed it into ../rules/ as mylocal.cf, which
finally works as expected, but I wonder if the is the proper
solution. Should be documented I guess...

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 27, 2009, 3:00 PM

Post #117 of 165 (2234 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #139 from Justin Mason <jm [at] jmason> 2009-10-27 15:00:50 UTC ---
(In reply to comment #137)
> Sort-of deliberately. Initially I followed the idea in wiki RescoreMassCheck
> section 4.2: 'comment out all "score" lines except for rules that you think
> the scores are accurate like carefully-vetted net rules, or 0.001 informational
> rules' which made perfect sense to me, so I did it for 50_scores.cf, except
> for a couple of rather obvious rules like _WHITELIST and similar, and the ones
> clearly indicated as 'indicators' only in the surrounding comments, or set to
> 0.001. Later I nailed a couple more. I followed a principle: when in doubt,
> leave it floating, it can be fixed later if necessary. It gives some insight
> into what GA 'thinks' about certain rules.

That's true. It's good to hear it's not a bug in the masses scripts, anyway ;)

> I think at least for some rules GA makes perfect sense, like RDNS_NONE
> and RDNS_DYNAMIC.

Yes, I agree, it's actually done a (surprisingly) good job with those.

> For some of them the GA result is close to the manually
> assigned score, or may indicate a need for reconsidering the assigned score.
> But I agree that more may need re-fixing.

cool.

In particular, some of the DNSBLs and most of the DNSWLs are good to 'lock
down', I feel, as users tend to 'compensate' or correct their scores more
frequently than other rules -- in my opinion. Also, if those are given low
scores by the GA, their operators tend to be annoyed, and it's not good to
annoy people who we're relying on ;)

It also reflects that those rules are slightly different, and hopefully
more reliable, than a typical body rule for example -- there's no way to
indicate this to the GA yet, so locking the rules is as good as we can do.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 27, 2009, 3:04 PM

Post #118 of 165 (2244 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #140 from Justin Mason <jm [at] jmason> 2009-10-27 15:04:51 UTC ---
(In reply to comment #138)
> I believe the masschecks leaves internal/external/msa_networks to their
> defaults, unless one cares to configure it correctly for his corpus. And
> I believe that it is more likely than not that some corpora were scanned
> with unsuitable settings of networks. I know that configuring it for my
> mass checks runs it gave me a headache (but I did it right in the end).

What should be happening, though, is that we're just underestimating the amount
of -lastexternal rule hits -- the S/O should still be correct, but the overall
number of hits will be less. Hopefully that will still provide a useful
estimation of accuracy.


> Docs don't say where one is supposed to put a local.cf with
> options which are ignored in masses/spamassassin/user_prefs
> (like Bayes SQL options, DCC, Pyzor timeouts etc).
>
> I tried to place local.cf into masses/spamassassin/, with
> horror results (some directives in local.cf proclaimed as
> invalid, as apparently plugins have not yet been loaded
> at the time of parsing this file, but only later).
>
> I finally placed it into ../rules/ as mylocal.cf, which
> finally works as expected, but I wonder if the is the proper
> solution. Should be documented I guess...

yuck. bug 6227.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 28, 2009, 9:02 AM

Post #119 of 165 (2200 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #141 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-28 09:02:40 UTC ---
>> But I agree that more may need re-fixing.
>
> cool.
> In particular, some of the DNSBLs and most of the DNSWLs are good to 'lock
> down', I feel, as users tend to 'compensate' or correct their scores more
> frequently than other rules -- in my opinion. Also, if those are given low
> scores by the GA, their operators tend to be annoyed, and it's not good to
> annoy people who we're relying on ;)
>
> It also reflects that those rules are slightly different, and hopefully
> more reliable, than a typical body rule for example -- there's no way to
> indicate this to the GA yet, so locking the rules is as good as we can do.

| It is quite possible that some of these hits are still false positives,
| despite several iterations of cleaning

I wonder how much is the low score for some ham rules affected by false
positives present in the spam* corpora. Here is some statistics for
the more prominent ham rules (i.e. the ones with negative scores).

For each rule the table shows a number of hits of this rule for each
corpus - both as a percentage of all entries in a file, and as absolute
counts. The entries standing out from the crowd that may need re-checking
are labeled with *** :

score ALL_TRUSTED -1.000
0.046 % 1/2194 spam-bayes-net-bb-kmcgrail
0.017 % 4/23761 spam-bayes-net-mmartinec
0.014 % 5/36941 spam-bayes-net-hege
0.001 % 1/81265 spam-bayes-net-bluestreak
0.000 % 1/931863 spam-bayes-net-dos

score BAYES_00 0 0 -1.2 -1.9
5.652 % 104/1840 spam-bayes-net-bb-jhardin ***
1.805 % 429/23761 spam-bayes-net-mmartinec
1.606 % 33/2055 spam-bayes-net-ahenry
0.439 % 357/81265 spam-bayes-net-bluestreak
0.374 % 138/36941 spam-bayes-net-hege
0.030 % 445/1489699 spam-bayes-net-jm
0.017 % 156/931863 spam-bayes-net-dos

score DCC_REPUT_00_12 0 -0.8 0 -0.4
0.164 % 39/23761 spam-bayes-net-mmartinec

score HABEAS_ACCREDITED_SOI 0 -1.634 0 -0.475
5.382 % 76/1412 spam-bayes-net-bb-guenther_fraud ***
0.272 % 5/1840 spam-bayes-net-bb-jhardin
0.091 % 2/2194 spam-bayes-net-bb-kmcgrail
0.059 % 14/23761 spam-bayes-net-mmartinec
0.049 % 18/36941 spam-bayes-net-hege
0.037 % 558/1489699 spam-bayes-net-jm
0.030 % 2/6728 spam-bayes-net-wt-en1
0.018 % 15/81265 spam-bayes-net-bluestreak
0.000 % 1/931863 spam-bayes-net-dos

score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8
0.163 % 3/1840 spam-bayes-net-bb-jhardin ***
0.091 % 2/2194 spam-bayes-net-bb-kmcgrail
0.071 % 1/1412 spam-bayes-net-bb-guenther_fraud
0.003 % 1/36941 spam-bayes-net-hege
0.000 % 1/1489699 spam-bayes-net-jm

score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2
1.250 % 23/1840 spam-bayes-net-bb-jhardin ***
(1.108 % 7/632 spam-bayes-net-binnocenti.OFF)
0.638 % 14/2194 spam-bayes-net-bb-kmcgrail
0.469 % 381/81265 spam-bayes-net-bluestreak
0.438 % 9/2055 spam-bayes-net-ahenry
0.223 % 15/6728 spam-bayes-net-wt-en1
0.214 % 79/36941 spam-bayes-net-hege
0.046 % 682/1489699 spam-bayes-net-jm
0.042 % 3/7185 spam-bayes-net-bb-zmi
0.013 % 3/23761 spam-bayes-net-mmartinec
0.010 % 2/19160 spam-bayes-net-wt-en4
0.003 % 29/931863 spam-bayes-net-dos

score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1
16.153 % 240627/1489699 spam-bayes-net-jm ***
(9.810 % 62/632 spam-bayes-net-binnocenti.OFF)
1.739 % 32/1840 spam-bayes-net-bb-jhardin
1.600 % 591/36941 spam-bayes-net-hege
1.159 % 78/6728 spam-bayes-net-wt-en1
1.133 % 16/1412 spam-bayes-net-bb-guenther_fraud
0.925 % 19/2055 spam-bayes-net-ahenry
0.365 % 8/2194 spam-bayes-net-bb-kmcgrail
0.107 % 87/81265 spam-bayes-net-bluestreak
0.097 % 7/7185 spam-bayes-net-bb-zmi
0.022 % 201/931863 spam-bayes-net-dos
0.021 % 5/23761 spam-bayes-net-mmartinec
0.016 % 3/19160 spam-bayes-net-wt-en4

score RCVD_IN_BSP_TRUSTED 0 -0.001 0 -0.001
5.312 % 75/1412 spam-bayes-net-bb-guenther_fraud ***
0.030 % 2/6728 spam-bayes-net-wt-en1
0.029 % 7/23761 spam-bayes-net-mmartinec
0.029 % 435/1489699 spam-bayes-net-jm
0.015 % 12/81265 spam-bayes-net-bluestreak
0.003 % 1/36941 spam-bayes-net-hege
0.001 % 11/931863 spam-bayes-net-dos

score RCVD_IN_IADB_DK 0 -0.044 0 -0.001
0.059 % 4/6728 spam-bayes-net-wt-en1
0.054 % 1/1840 spam-bayes-net-bb-jhardin
0.033 % 27/81265 spam-bayes-net-bluestreak
0.004 % 1/23761 spam-bayes-net-mmartinec
0.001 % 21/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_RDNS 0 -0.018 0 -0.001
0.342 % 23/6728 spam-bayes-net-wt-en1 ***
0.054 % 1/1840 spam-bayes-net-bb-jhardin
0.049 % 1/2055 spam-bayes-net-ahenry
0.033 % 27/81265 spam-bayes-net-bluestreak
0.004 % 1/23761 spam-bayes-net-mmartinec
0.002 % 26/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_OPTIN 0 -3.265 0 -2.791
0.342 % 23/6728 spam-bayes-net-wt-en1 ***
0.049 % 1/2055 spam-bayes-net-ahenry
0.000 % 4/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_OPTIN_GT50 0 -0.219 0 -1.041
0.054 % 1/1840 spam-bayes-net-bb-jhardin

score RCVD_IN_IADB_DOPTIN 0
0.000 % 7/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_DOPTIN_LT50 0 -0.001 0 -0.001
0.026 % 21/81265 spam-bayes-net-bluestreak ***
0.001 % 15/1489699 spam-bayes-net-jm.log

score RCVD_IN_IADB_DOPTIN_GT50 0
0.007 % 6/81265 spam-bayes-net-bluestreak
0.004 % 1/23761 spam-bayes-net-mmartinec

score RCVD_IN_IADB_ML_DOPTIN 0
0.000 % 2/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_UT_CPR_MAT 0 -0.001 0 -0.052
0.026 % 21/81265 spam-bayes-net-bluestreak ***
0.001 % 15/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_MI_CPR_MAT 0 -0.079 0 -0.001
0.026 % 21/81265 spam-bayes-net-bluestreak ***
0.001 % 15/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_LISTED 0 -1.144 0 -0.001
0.342 % 23/6728 spam-bayes-net-wt-en1 ***
0.054 % 1/1840 spam-bayes-net-bb-jhardin
0.049 % 1/2055 spam-bayes-net-ahenry
0.033 % 27/81265 spam-bayes-net-bluestreak
0.004 % 1/23761 spam-bayes-net-mmartinec
0.002 % 26/1489699 spam-bayes-net-jm
0.000 % 1/931863 spam-bayes-net-dos

score RCVD_IN_IADB_SENDERID 0 -0.001 0 -0.001
0.208 % 14/6728 spam-bayes-net-wt-en1 ***
0.049 % 1/2055 spam-bayes-net-ahenry
0.033 % 27/81265 spam-bayes-net-bluestreak
0.004 % 1/23761 spam-bayes-net-mmartinec
0.000 % 4/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_SPF 0 -0.006 0 -0.042
0.342 % 23/6728 spam-bayes-net-wt-en1 ***
0.054 % 1/1840 spam-bayes-net-bb-jhardin
0.049 % 1/2055 spam-bayes-net-ahenry
0.033 % 27/81265 spam-bayes-net-bluestreak
0.004 % 1/23761 spam-bayes-net-mmartinec
0.002 % 26/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_VOUCHED 0 -1.718 0 -0.956
0

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 28, 2009, 10:23 AM

Post #120 of 165 (2207 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #142 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-28 10:23:19 UTC ---
Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly
false positives are due to freelotto.com mail. I wonder whether such
samples are rightfully in the spam* corpora - I'd say yes, but,
as they say, spam is about consent, not content, and people receiving
mail from freelotto.com most likely did register once, not realizing
what they are dealing with. So there was a consent, at least initially.
It is also about fraud and advertising, so, should one leave such
mail samples in the spam corpus or not?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 28, 2009, 10:41 AM

Post #121 of 165 (2216 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #143 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-28 10:41:31 UTC ---
> Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly
> false positives are due to freelotto.com mail.

Same applies to RCVD_IN_BSP_TRUSTED spam hits.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 29, 2009, 6:33 PM

Post #122 of 165 (2167 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #144 from Warren Togami <wtogami [at] redhat> 2009-10-29 18:33:38 UTC ---
What is the next step in order to move forward?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 4, 2009, 3:52 PM

Post #123 of 165 (1820 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Adam Katz <antispam [at] khopis> changed:

What |Removed |Added
----------------------------------------------------------------------------
Attachment #4561|0 |1
is obsolete| |

--- Comment #145 from Adam Katz <antispam [at] khopis> 2009-11-04 15:52:15 UTC ---
Created an attachment (id=4564)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4564)
Checker for rules that match more ham than spam

Updated my checker to use S/O (now that I understand that stat). It also
supports specifying the DateRev for the specific masscheck run. Since today's
run was sparse, here are yesterday's results.

$ ./sa33badrules.pl 20091103-r832343-n
S/O RANK HAM% SPAM% Score in attachment 4558 Rule
.008 .12 1.2401 0.0105 0.001 MSGID_MULTIPLE_AT
.011 .22 0.3066 0.0035 0 OBSCURED_EMAIL
.012 .25 0.2058 0.0025 0.000 2.099 0.001 1.212 MISSING_MIME_HB_SEP
.014 .17 0.5822 0.0080 0.001 0.001 0.699 0.699 TVD_RCVD_SPACE_BRACKET
.028 .20 0.4339 0.0125 unknown TVD_FUZZY_SECTOR
.042 .28 0.1732 0.0075 0 SUBJECT_FUZZY_TION
.048 .77 4.4862 0.2279 -0.001 SPF_HELO_PASS
.052 .29 0.1476 0.0080 1.494 1.699 1.591 1.516 X_IP
.055 .22 0.3914 0.0226 2.205 0.174 1.299 1.806 FRT_SOMA2
.062 .74 5.1484 0.3424 -0.001 SPF_PASS
.077 .25 0.2643 0.0221 0.987 0.750 0.943 1.318 CTYPE_001C_B
.079 .36 0.0640 0.0055 0.001 0.001 0.605 0.378 HTML_NONELEMENT_30_40
.080 .28 0.1742 0.0151 0.001 2.499 0.268 0.516 DRUGS_MUSCLE
.084 .36 0.0660 0.0060 0 FORGED_IMS_TAGS
.090 .32 0.1114 0.0110 0.033 0.001 0.365 0.413 WEIRD_PORT
.092 .21 0.8712 0.0878 1.499 0.419 0.904 0.798 MIME_BASE64_BLANKS
.102 .37 0.0577 0.0065 0 HTML_IFRAME_SRC
.123 .34 0.0821 0.0115 0.003 0.978 0.100 1.515 TVD_FW_GRAPHIC_NAME_LONG
.128 .37 0.0614 0.0090 0 RCVD_BAD_ID
.130 .29 0.1851 0.0276 0.001 0.020 0.001 1.799 MIME_BASE64_TEXT
.178 .28 0.4948 0.1069 0 1.200 0 2.514 SPF_HELO_FAIL
.202 .32 0.1590 0.0402 0.1 ANY_BOUNCE_MESSAGE
.205 .35 0.0817 0.0211 2.199 1.622 2.199 1.086 LONGWORDS
.213 .34 0.1186 0.0321 0 BLANK_LINES_80_90
.216 .32 0.1474 0.0407 2.199 2.199 1.246 2.090 WEIRD_QUOTING
.218 .32 0.1445 0.0402 0.1 BOUNCE_MESSAGE
.223 .30 0.7605 0.2179 1.799 0.572 1.182 1.138 HTML_IMAGE_RATIO_06
.241 .34 1.3973 0.4438 1.0 EXTRA_MPART_TYPE
.254 .34 0.1222 0.0417 0.001 2.185 1.936 0.476 FRT_SOMA
.283 .33 0.6883 0.2711 0.539 0.001 0.332 0.488 MIME_HTML_MOSTLY
.299 .36 0.0908 0.0387 0.799 0.001 0.711 0.026 TVD_FW_GRAPHIC_NAME_MID
.303 .34 0.4938 0.2143 1.899 0.496 0.950 0.445 HTML_IMAGE_RATIO_08
.367 .40 1.2775 0.7409 0.001 TVD_SPACE_RATIO
.379 .37 0.3182 0.1943 0.023 0.887 0.000 0.417 UPPERCASE_50_75
.434 .39 0.3261 0.2505 3.099 1.823 1.802 1.998 BAD_ENC_HEADER
.436 .46 15.3798 11.8920 0.001 FREEMAIL_FROM
.454 .41 0.5503 0.4573 2.260 0.742 1.199 0.640 MPART_ALT_DIFF
.516 .47 3.6581 3.9024 0.001 MIME_QP_LONG_LINE
.655 .51 1.9537 3.7036 1.154 1.677 1.198 1.453 SUBJ_ALL_CAPS
.665 .49 42.2269 83.7383 0.001 HTML_MESSAGE
.692 .52 1.1850 2.6580 0.001 UNPARSEABLE_RELAY
.922 .58 1.1584 13.7423 0 1.322 0 1.237 RCVD_IN_BL_SPAMCOP_NET
.935 .57 3.5421 50.6034 2.199 0.955 1.215 0.549 MIME_HTML_ONLY
.970 .52 1.5729 51.1430 0 1.1 0 0.7 RDNS_NONE

Note, I hacked RDNS_NONE so that it removes the Enron hits.

"Problem" rules this week include X_IP, EXTRA_MPART_TYPE, FRT_SOMA2, and
BAD_ENC_HEADER (scored 3.099?!).

Food for thought: while it's good to create workarounds for the problematic
outcomes from the genetic algorithm, I think that these should be examples with
which to troubleshoot the algorithm itself while this might just be an early
sign of over-fitting (which is largely fine as long as we comb through the
results with scripts like this), it might also be indicative of a problem in
the system's prioritization.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 6, 2009, 12:33 PM

Post #124 of 165 (1756 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Mark Martinec <Mark.Martinec [at] ijs> changed:

What |Removed |Added
----------------------------------------------------------------------------
Attachment #4558|0 |1
is obsolete| |

--- Comment #146 from Mark Martinec <Mark.Martinec [at] ijs> 2009-11-06 12:33:41 UTC ---
Created an attachment (id=4565)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4565)
resulting 50_scores.cf from garescorer runs - V5

A new run, this time I left the URIBL whitelists and similar fixed
(at their relatively high manual scores) as they were in current 50_scores.cf

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 6, 2009, 12:38 PM

Post #125 of 165 (1765 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #147 from Mark Martinec <Mark.Martinec [at] ijs> 2009-11-06 12:38:36 UTC ---
Corresponding GA summaries ($ head test scores):

gen-set3-20-5.0-14000-ga-best
==> test <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 21171 99.93%
# Correctly spam: 43624 98.84%
# False positives: 15 0.07%
# False negatives: 510 1.16%
# TCR(l=50): 35.026984 SpamRecall: 98.844% SpamPrec: 99.966%
==> scores <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168144 32.193% (99.979% of non-spam corpus)
# Correctly spam: 349846 66.982% (98.794% of spam corpus)
# False positives: 35 0.007% (0.021% of nonspam, 8289 weighted)
# False negatives: 4270 0.818% (1.206% of spam, 13858 weighted)
# Average score for spam: 21.3 nonspam: -3.2
# Average for false-pos: 5.6 false-neg: 3.2
# TOTAL: 522295 100.00%


gen-set2-10-5.0-6500-ga-best
==> test <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 21149 99.83%
# Correctly spam: 41755 94.61%
# False positives: 37 0.17%
# False negatives: 2379 5.39%
# TCR(l=50): 10.436037 SpamRecall: 94.610% SpamPrec: 99.911%
==> scores <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 167927 32.152% (99.850% of non-spam corpus)
# Correctly spam: 335063 64.152% (94.620% of spam corpus)
# False positives: 252 0.048% (0.150% of nonspam, 29229 weighted)
# False negatives: 19053 3.648% (5.380% of spam, 68835 weighted)
# Average score for spam: 11.1 nonspam: -1.0
# Average for false-pos: 5.5 false-neg: 3.6
# TOTAL: 522295 100.00%


gen-set1-10-5.0-14000-ga-best
==> test <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 21151 99.83%
# Correctly spam: 43145 97.76%
# False positives: 35 0.17%
# False negatives: 989 2.24%
# TCR(l=50): 16.113180 SpamRecall: 97.759% SpamPrec: 99.919%
==> scores <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168009 32.167% (99.899% of non-spam corpus)
# Correctly spam: 346230 66.290% (97.773% of spam corpus)
# False positives: 170 0.033% (0.101% of nonspam, 20632 weighted)
# False negatives: 7886 1.510% (2.227% of spam, 22952 weighted)
# Average score for spam: 20.1 nonspam: -1.5
# Average for false-pos: 5.8 false-neg: 2.9
# TOTAL: 522295 100.00%


gen-set0-5-5.0-14000-ga-best
==> test <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 20925 98.77%
# Correctly spam: 36049 81.68%
# False positives: 261 1.23%
# False negatives: 8085 18.32%
# TCR(l=50): 2.088195 SpamRecall: 81.681% SpamPrec: 99.281%
==> scores <==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 166235 31.828% (98.844% of non-spam corpus)
# Correctly spam: 288300 55.199% (81.414% of spam corpus)
# False positives: 1944 0.372% (1.156% of nonspam, 128482 weighted)
# False negatives: 65816 12.601% (18.586% of spam, 202271 weighted)
# Average score for spam: 10.5 nonspam: 0.6
# Average for false-pos: 6.3 false-neg: 3.1
# TOTAL: 522295 100.00%

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

First page Previous page 1 2 3 4 5 6 7 Next page Last page  View All SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.