Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: devel

[Bug 6155] generate new scores for 3.3.0 release

 

 

First page Previous page 1 2 3 4 5 6 7 Next page Last page  View All SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded


bugzilla-daemon at bugzilla

Nov 6, 2009, 4:31 PM

Post #126 of 165 (2053 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #148 from Mark Martinec <Mark.Martinec [at] ijs> 2009-11-06 16:31:11 UTC ---
Created an attachment (id=4566)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4566)
GA cost vs. iterations

Here is a somewhat interesting diagram, showing how the 'cost' as optimized by
GA is minimized through iterations. Data comes from the nohup.out log file,
where each GA iteration looks like:

123456789
Pop size, replacement: 50 33

Adapt (t, fneg, fneg_add, fpos, fpos_add): 1250 4776 0 0 0
Adapt (over, cross, repeat): 1 1 4131
Performance: 0.672 iterations/s, iteration no. 10900

# SUMMARY for threshold 5.0:
# Correctly non-spam: 168144 32.193% (99.979% of non-spam corpus)
# Correctly spam: 349845 66.982% (98.794% of spam corpus)
# False positives: 35 0.007% (0.021% of nonspam, 8290 weighted)
# False negatives: 4271 0.818% (1.206% of spam, 13863 weighted)
# Average score for spam: 21.1 nonspam: -3.2
# Average for false-pos: 5.6 false-neg: 3.2
# TOTAL: 522295 100.00%

From the above, the extracted data for this iteration is:
- iteration count: 10900
- FP weighted: 8290
- FN weighted: 13863

So the chart plots FP weighted and FN weighted cost against iteration count.
Each of the four colours corresponds to one set (set3: net+bayes,
set2: nonet+bayes, set1: net+nobayes, set0: nonet+nobayes).
The thicker line of each pair is a FP line, the thinner is a FN line.

The purpose of the chart is to determine if the chosen max iterations
limit is sensible: still gains some benefit without coming into
overfitting or wasting too much time.

One safety valve against overfitting is to check if the 10% test
sample produces similar results as the learning set (90%).
The other test I made is to repeat the runs with a limit of about
5000 iterations (instead of 14000) and compare the results - which
are indeed similar.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 6, 2009, 4:34 PM

Post #127 of 165 (2056 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #149 from Mark Martinec <Mark.Martinec [at] ijs> 2009-11-06 16:34:46 UTC ---
Created an attachment (id=4567)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4567)
Scaled diagram of the previous one, only sets 3 and 1 shown

Here is the same diagram as above, but scaled so as not be be compressed
by poor results of set 0. Also, only the two score sets are shown: 1 and 3,
i.e. both sets with network tests, without and with bayes.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 7, 2009, 1:33 PM

Post #128 of 165 (2041 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #150 from Justin Mason <jm [at] jmason> 2009-11-07 13:33:19 UTC ---
(In reply to comment #146)
> Created an attachment (id=4565)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4565) [details]
> resulting 50_scores.cf from garescorer runs - V5
>
> A new run, this time I left the URIBL whitelists and similar fixed
> (at their relatively high manual scores) as they were in current 50_scores.cf

After a little examination, they look good to me! +1 to check in.

RCVD_IN_XBL is still surprisingly low -- I bet there's some additive behaviour
overlapping between XBL and PBL, though.

RCVD_IN_SBL is _very_ low in set 3 too, bizarre!

otherwise I can't see any issues....



btw if you feel like cranking up the max gens, go for it. fwiw,
spamassassin2.zones has a very powerful CPU -- if it's taking too long on your
own machine, try scping stuff up and running it there.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 7, 2009, 3:46 PM

Post #129 of 165 (2042 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #151 from Warren Togami <wtogami [at] redhat> 2009-11-07 15:46:54 UTC ---
Please manually adjust the scores of RCVD_IN_PSBL up. At the time of the
rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a
number of major ISP's. As a result, for 5 weeks straight RCVD_IN_PSBL has been
almost completely devoid of FP's in our weekly masschecks. I am confident that
PSBL performs safer than measured during the rescore masscheck.

http://ruleqa.spamassassin.org/20090829-r809102-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090905-r811608-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090912-r814117-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090919-r816871-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090926-r819101-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091003-r821273-n/RCVD_IN_PSBL/detail
(below this point FP rate dropped to nearly zero)
http://ruleqa.spamassassin.org/20091010-r823821-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091017-r826198-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091024-r829323-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091031-r831520-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091107-r833654-n/RCVD_IN_PSBL/detail
You can plainly see steady and sustained improvement in FP safety in these past
weeks.

RCVD_IN_PSBL in the rescore masscheck was without lastexternal. Clearly with
the added limitation of lastexternal it is safer than measured.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 8, 2009, 4:36 PM

Post #130 of 165 (2023 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #152 from Mark Martinec <Mark.Martinec [at] ijs> 2009-11-08 16:36:24 UTC ---
> > A new run, this time I left the URIBL whitelists and similar fixed
> > (at their relatively high manual scores) as they were in current
> > 50_scores.cf

Or to say it better: unlike my previous runs where I commented out most
scores in the existing 50_scores.cf (thus making them mutable, regardless
of a <gen:mutable> markup) except for a couple of exceptions, this time
I did not comment-out scores, and let <gen:mutable> markup do its job.
So this is now more like how it was intended to run GA.

> After a little examination, they look good to me! +1 to check in.

Thanks. I'm sure we can can still do some manual tweaks and improvements,
but perhaps we can indeed freeze the rest to automatically assigned scores
in this run.

> btw if you feel like cranking up the max gens, go for it. fwiw,
> spamassassin2.zones has a very powerful CPU -- if it's taking too long
> on your own machine, try scping stuff up and running it there.

My office workstation is quite beefy too, and I hope we won't need to do
many further runs, so for now I'd just stick to what I'm familiar with.
Btw, my set3 run at 14000 iterations takes 5 hours, similar for set1, the
other two are much faster (less than 30 minutes each). I just let it run
overnight, so it wouldn't matter if it takes half that time. I did some
previous runs at 30000 iterations, and a diagram (like the one attached
earlier) does not show noticeable improvements beyond about 10000, or even
small worsening by the end, so the 14000 limit seems reasonable. And the
GA algorithms are said to be prone to overfitting, so it's probably prudent
not to go too far.



> RCVD_IN_XBL is still surprisingly low -- I bet there's some additive
> behaviour overlapping between XBL and PBL, though.
> RCVD_IN_SBL is _very_ low in set 3 too, bizarre!
> otherwise I can't see any issues....

| Please manually adjust the scores of RCVD_IN_PSBL up. At the time of the
| rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a
| number of major ISP's. As a result, for 5 weeks straight RCVD_IN_PSBL has
| been almost completely devoid of FP's in our weekly masschecks. I am
| confident that PSBL performs safer than measured during the rescore masscheck

Ok, I suggest we collect some manual fixes like the ones suggested here
(with specific score suggestions), and wrap it up.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 9, 2009, 3:40 PM

Post #131 of 165 (2008 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Adam Katz <antispam [at] khopis> changed:

What |Removed |Added
----------------------------------------------------------------------------
Attachment #4564|0 |1
is obsolete| |

--- Comment #153 from Adam Katz <antispam [at] khopis> 2009-11-09 15:40:31 UTC ---
Created an attachment (id=4568)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4568)
Checker for rules that match more ham than spam

Collected selections from several more runs of my script. I took the last
three days' worth of masschecks plus the run last week, hand-picked rules with
a high score (~1.0+) but low S/O (~0.250-), and then looked for repeat
offenders. This is the list, with each rule's worst S/O of any run:

S/O RANK HAM% SPAM% Score attachment 4565 Rule
.002 .14 1.2650 0.0024 0.001 0.001 0.131 0.700 TVD_RCVD_SPACE_BRACKET
.002 .23 0.4472 0.0008 0.000 2.099 0.001 1.711 MISSING_MIME_HB_SEP
.019 .22 0.2529 0.0049 1.482 0.855 2.399 2.399 FUZZY_CPILL
.019 .29 0.2809 0.0056 0.001 1.699 1.498 1.699 X_IP
.046 .22 0.4010 0.0193 2.385 0.345 0.998 2.503 FRT_SOMA2
.077 .25 0.2643 0.0221 0.551 1.026 1.033 1.250 CTYPE_001C_B
.092 .21 0.8712 0.0878 0.699 0.332 0.480 0.800 MIME_BASE64_BLANKS
.095 .31 0.2735 0.0286 2.200 2.199 0.540 2.199 WEIRD_QUOTING
.178 .28 0.4948 0.1069 0 0.973 0 2.385 SPF_HELO_FAIL
.195 .29 0.8975 0.2173 1.799 0.579 0.901 0.882 HTML_IMAGE_RATIO_06
.241 .34 1.4248 0.4529 1.0 EXTRA_MPART_TYPE

I don't think it wise to release with these scores quite so high. I propose we
score them all 0.1 or 0.001 so as to not hold up the release and bookmark the
issue (likely a bug in the GA, probably best registered as its own bugzilla
bug) for dealing with later.


Additionally, I've updated my script to do the reverse - seek out negatively
scored rules that hit more spam than ham. This doesn't currently find anything
beyond SPF_PASS (due to having >=1% spam hits, while it was previously found
for having ham>spam), but it does prevent listing SPF_HELO_PASS and
theoretically will help find poorly-written ham rules in the future.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 11, 2009, 11:38 AM

Post #132 of 165 (1962 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #154 from Warren Togami <wtogami [at] redhat> 2009-11-11 11:38:13 UTC ---
(In reply to comment #152)
>
> | Please manually adjust the scores of RCVD_IN_PSBL up. At the time of the
> | rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a
> | number of major ISP's. As a result, for 5 weeks straight RCVD_IN_PSBL has
> | been almost completely devoid of FP's in our weekly masschecks. I am
> | confident that PSBL performs safer than measured during the rescore masscheck
>
> Ok, I suggest we collect some manual fixes like the ones suggested here
> (with specific score suggestions), and wrap it up.

Let's just go ahead with committing as jm suggested in Comment #153 and make
the manual adjustments after that in separate commits each with explanations.

RCVD_IN_PSBL I suggest 2.7 for both network sets.

Adam Katz in Comment #153 makes a good argument for reducing those rules to
informational. Any comments on that?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 11, 2009, 2:13 PM

Post #133 of 165 (1964 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #155 from Justin Mason <jm [at] jmason> 2009-11-11 14:13:16 UTC ---
(In reply to comment #154)
> (In reply to comment #152)
> >
> > | Please manually adjust the scores of RCVD_IN_PSBL up. At the time of the
> > | rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a
> > | number of major ISP's. As a result, for 5 weeks straight RCVD_IN_PSBL has
> > | been almost completely devoid of FP's in our weekly masschecks. I am
> > | confident that PSBL performs safer than measured during the rescore masscheck
> >
> > Ok, I suggest we collect some manual fixes like the ones suggested here
> > (with specific score suggestions), and wrap it up.
>
> Let's just go ahead with committing as jm suggested in Comment #153 and make
> the manual adjustments after that in separate commits each with explanations.
>
> RCVD_IN_PSBL I suggest 2.7 for both network sets.
>
> Adam Katz in Comment #153 makes a good argument for reducing those rules to
> informational. Any comments on that?

+1 to all ;)

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 11, 2009, 3:42 PM

Post #134 of 165 (1955 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #156 from Warren Togami <wtogami [at] redhat> 2009-11-11 15:42:49 UTC ---
I might have to eat my words. Applying these new scores did not improve my own
statistics.

ORIGINAL SCORES
./fp-fn-statistics -s 3 (wt-* 20091107 weekly logs)

# SUMMARY for threshold 5.0:
# Correctly non-spam: 29677 99.82%
# Correctly spam: 21106 90.42%
# False positives: 54 0.18%
# False negatives: 2235 9.58%
# TCR(l=50): 4.729686 SpamRecall: 90.425% SpamPrec: 99.745%

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155#c146
GA SCORES
./fp-fn-statistics -s 3 (wt-* 20091107 weekly logs)

# SUMMARY for threshold 5.0:
# Correctly non-spam: 29624 99.64%
# Correctly spam: 21039 90.14%
# False positives: 107 0.36%
# False negatives: 2302 9.86%
# TCR(l=50): 3.050314 SpamRecall: 90.138% SpamPrec: 99.494%

(In reply to comment #153)
> Created an attachment (id=4568)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4568) [details]
> Checker for rules that match more ham than spam
>
> Collected selections from several more runs of my script. I took the last
> three days' worth of masschecks plus the run last week, hand-picked rules with
> a high score (~1.0+) but low S/O (~0.250-), and then looked for repeat
> offenders. This is the list, with each rule's worst S/O of any run:
>
> S/O RANK HAM% SPAM% Score attachment 4565 [details] Rule
> .195 .29 0.8975 0.2173 1.799 0.579 0.901 0.882 HTML_IMAGE_RATIO_06

score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437
score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556
score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882
score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021

Is it logical to zero out HTML_IMAGE_RATIO_06 when these others have scores?
It feels like either our corpus sample size was not large and varied enough, or
we are doing something else wrong. These particular rules had scores much
lower from the 3.2.0 GA.

> S/O RANK HAM% SPAM% Score attachment 4565 [details] Rule
> .241 .34 1.4248 0.4529 1.0 EXTRA_MPART_TYPE

I suppose this is the clearest case of a rule we should zero out.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 12, 2009, 10:08 AM

Post #135 of 165 (1943 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #157 from Warren Togami <wtogami [at] redhat> 2009-11-12 10:07:55 UTC ---
TVD_RCVD_SPACE_BRACKET
MISSING_MIME_HB_SEP
FUZZY_CPILL
X_IP Bug #5920 appears not fixed as claimed.
FRT_SOMA2
CTYPE_001C_B
MIME_BASE64_BLANKS
WEIRD_QUOTING
SPF_HELO_FAIL
EXTRA_MPART_TYPE

It appears to be correct to zero out these rules, or at least make them
informational.

spamassassin-3.2.5
score HTML_IMAGE_RATIO_02 1.518 0.550 0.573 0.383
score HTML_IMAGE_RATIO_04 1.561 0.170 0.863 0.172
score HTML_IMAGE_RATIO_06 0.401 0.001 0.501 0.001
score HTML_IMAGE_RATIO_08 0.203 0.001 0.179 0.001

attachment 4565
resulting 50_scores.cf from garescorer runs - V5
score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437
score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556
score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882
score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021

The old scores showed a more linear relationship, with a sharp drop-off between
_04 and _06. Our masscheck results indicate _02 and _04 hit on more spam than
ham, but _06 and _08 are pretty worthless. I think we should zero out _06 and
_08 while reducing the scores of _02 and _04.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 12, 2009, 4:20 PM

Post #136 of 165 (1924 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #158 from Adam Katz <antispam [at] khopis> 2009-11-12 16:20:15 UTC ---
(In reply to comment #157)
> spamassassin-3.2.5
> score HTML_IMAGE_RATIO_02 1.518 0.550 0.573 0.383
> score HTML_IMAGE_RATIO_04 1.561 0.170 0.863 0.172
> score HTML_IMAGE_RATIO_06 0.401 0.001 0.501 0.001
> score HTML_IMAGE_RATIO_08 0.203 0.001 0.179 0.001
>
> attachment 4565 [details]
> resulting 50_scores.cf from garescorer runs - V5
> score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437
> score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556
> score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882
> score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021
>
> The old scores showed a more linear relationship, with a sharp drop-off
> between _04 and _06. Our masscheck results indicate _02 and _04 hit on
> more spam than ham, but _06 and _08 are pretty worthless. I think we
> should zero out _06 and _08 while reducing the scores of _02 and _04.

I didn't mention _08 because it wasn't a remarkable enough margin of HAM > SPAM
(my script only reports if HAM% + 0.05 > SPAM%) and my hand-sampling utilized
S/O ratios under .250 while this rule is .320. Still, it has the problem:

SPAM% HAM% S/O RANK SCORE NAME DateRev
0.2709 0.5491 0.330 0.34 0.20 HTML_IMAGE_RATIO_08 20091111-r834803-n
0.2717 0.5492 0.331 0.34 0.20 HTML_IMAGE_RATIO_08 20091110-r834389-n
0.2672 0.5493 0.327 0.34 0.20 HTML_IMAGE_RATIO_08 20091109-r833997-n
0.2075 0.4995 0.294 0.34 0.20 HTML_IMAGE_RATIO_08 20091104-r832683-n
0.2548 0.5476 0.318 0.34 0.20 HTML_IMAGE_RATIO_08 20091028-r830464-n

Here are the results from the 20091111-r834803-n set, pruning only rules
scoring under 0.2 (all hits from my last report are present and asterisked):

S/O RANK HAM% SPAM% Score in attachment 4565 Rule
.014 .15 0.6328 0.0093 0.001 0.001 0.131 0.700 TVD_RCVD_SPACE_BRACKET*
.015 .24 0.1927 0.0029 0.000 2.099 0.001 1.711 MISSING_MIME_HB_SEP*
.019 .22 0.2528 0.0049 1.482 0.855 2.399 2.399 FUZZY_CPILL*
.043 .29 0.1298 0.0059 0.001 1.699 1.498 1.699 X_IP*
.075 .35 0.0603 0.0049 0.000 0.001 0.308 0.001 HTML_NONELEMENT_30_40
.092 .21 0.8123 0.0825 0.699 0.332 0.480 0.800 MIME_BASE64_BLANKS*
.106 .25 0.2483 0.0293 0.551 1.026 1.033 1.250 CTYPE_001C_B*
.123 .33 0.0837 0.0117 0.001 0.648 0.836 1.293 TVD_FW_GRAPHIC_NAME_LONG
.123 .28 0.1632 0.0229 0.001 2.499 0.392 0.164 DRUGS_MUSCLE(*)
.130 .25 0.3663 0.0547 2.385 0.345 0.998 2.503 FRT_SOMA2*
.155 .29 0.1736 0.0317 0.001 0.001 0.001 1.741 MIME_BASE64_TEXT
.188 .27 0.4622 0.1069 0 0.973 0 2.385 SPF_HELO_FAIL*
.214 .31 0.1449 0.0395 2.200 2.199 0.540 2.199 WEIRD_QUOTING*
.239 .30 0.8321 0.2612 1.799 0.579 0.901 0.882 HTML_IMAGE_RATIO_06*
.254 .34 1.3070 0.4442 1.0 EXTRA_MPART_TYPE*
.330 .34 0.5491 0.2709 1.410 0.351 0.874 0.021 HTML_IMAGE_RATIO_08
.363 .38 1.0856 0.6194 2.600 2.070 1.233 3.405 DATE_IN_PAST_96_XX
.368 .36 0.3029 0.1767 0.001 0.791 0.001 0.008 UPPERCASE_50_75
.381 .37 0.6473 0.3983 0.354 0.001 0.725 0.428 MIME_HTML_MOSTLY
.660 .51 1.8514 3.5893 0.518 1.625 1.197 1.506 SUBJ_ALL_CAPS
.905 .58 1.0822 10.2987 0 1.246 0 1.347 RCVD_IN_BL_SPAMCOP_NET
.934 .56 3.6172 51.2001 2.199 1.105 1.199 0.723 MIME_HTML_ONLY
.957 .52 2.2200 50.3063 2.399 1.274 1.228 0.793 RDNS_NONE

DRUGS_MUSCLE met all the requirements I set for my last report, but I removed
it because it had almost no hits anyway, and it scored very very low except on
net+no-bayes, so I was assuming it had some justification there somehow.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 16, 2009, 4:28 PM

Post #137 of 165 (1730 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #159 from Justin Mason <jm [at] jmason> 2009-11-16 16:27:51 UTC ---
will we go ahead and check in those scores, anyway? that would allow another
beta (soon).

re: HTML_IMAGE_RATIO_* -- it's very common for that kind of "multi-valued" set
of rules to wind up with nonintuitive scoring. This happens from either low
hitrates or hitting alongside other (better) rules.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 16, 2009, 6:28 PM

Post #138 of 165 (1731 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #160 from Warren Togami <wtogami [at] redhat> 2009-11-16 18:28:03 UTC ---
(In reply to comment #142)
> Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly
> false positives are due to freelotto.com mail. I wonder whether such
> samples are rightfully in the spam* corpora - I'd say yes, but,
> as they say, spam is about consent, not content, and people receiving
> mail from freelotto.com most likely did register once, not realizing
> what they are dealing with. So there was a consent, at least initially.
> It is also about fraud and advertising, so, should one leave such
> mail samples in the spam corpus or not?

Perhaps we should explicitly exclude known sketchy senders like freelotto.com
from HABEAS_ACCREDITED_SOI. This would allow us to more easily monitor for
clear violators by not being distracted by the common FP's. Exclusion in this
case only brings the listed back to neutral which is pretty clearly a good
idea.

Any objections? Otherwise I'll file a separate bug for this.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 16, 2009, 7:27 PM

Post #139 of 165 (1743 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #161 from Warren Togami <wtogami [at] redhat> 2009-11-16 19:27:50 UTC ---
-score RDNS_NONE 0.1
-score RDNS_DYNAMIC 0.1
+# score RDNS_NONE 0 1.1 0 0.7
+# score RDNS_DYNAMIC 0 0.5 0 0.5

These are supposed to be informational rules according to the comment. Is this
supposed to become commented out? Doesn't commented out mean 1 point?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 16, 2009, 9:28 PM

Post #140 of 165 (1725 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #162 from Warren Togami <wtogami [at] redhat> 2009-11-16 21:28:44 UTC ---
fp-fn-statistics across the entire "rescore" logs.

Set 3 Before
===========
# SUMMARY for threshold 5.0:
# Correctly non-spam: 703647 99.90%
# Correctly spam: 2559525 98.28%
# False positives: 719 0.10%
# False negatives: 44795 1.72%
# TCR(l=50): 32.253638 SpamRecall: 98.280% SpamPrec: 99.972%

Set 3 Raw Rescoring from Comment #146
==================================
# SUMMARY for threshold 5.0:
# Correctly non-spam: 703520 99.88%
# Correctly spam: 2548134 97.84%
# False positives: 846 0.12%
# False negatives: 56186 2.16%
# TCR(l=50): 26.443555 SpamRecall: 97.843% SpamPrec: 99.967%

Doesn't look like an improvement.

Set 3 + Rescore + Reductions
==========================
# SUMMARY for threshold 5.0:
# Correctly non-spam: 704002 99.95%
# Correctly spam: 2558896 98.26%
# False positives: 364 0.05%
# False negatives: 45424 1.74%
# TCR(l=50): 40.932981 SpamRecall: 98.256% SpamPrec: 99.986%

Looks like a statistically insignificant improvement over the old scores. I
only hope our corpora was sufficiently varied.

Rules Made Informational
======================
TVD_RCVD_SPACE_BRACKET
MISSING_MIME_HB_SEP
FUZZY_CPILL
X_IP Bug #5920 appears not fixed as claimed.
FRT_SOMA2
CTYPE_001C_B
MIME_BASE64_BLANKS
WEIRD_QUOTING
SPF_HELO_FAIL
HTML_IMAGE_RATIO_06
HTML_IMAGE_RATIO_08

Other Changes
============
* EXTRA_MPART_TYPE was left as 1.0 because while it does relatively poorly in
the weeky masscheck, it did far better in rescore masscheck.
* I am increasing the scores of PSBL *after* the above fp-fn-statistics run
because the old logs do not reflect its current safety level.

I am committing these changes now. I suspect the key to these reductions is
getting rid of the rules that wouldn't have passed our ruleqa auto-promotion
criteria? There might be additional tweaks to make. Please comment here.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 16, 2009, 10:59 PM

Post #141 of 165 (1723 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #163 from Warren Togami <wtogami [at] redhat> 2009-11-16 22:58:57 UTC ---
http://hudson.zones.apache.org/hudson/job/SpamAssassin-trunk/4344/testReport/
-score MISSING_HB_SEP 2.5
+# score MISSING_HB_SEP 2.5
+score MISSING_HB_SEP 0 # n=0 n=1 n=2

-score X_MESSAGE_INFO 3.499 3.496 3.330 1.597
+score X_MESSAGE_INFO 0 # n=0 n=1 n=2 n=3

It appears that tests here are failing after commit because rules required by
this test were zeroed out. It seems these rules have almost zero hits in
masscheck. What should we do about this?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 17, 2009, 3:03 AM

Post #142 of 165 (1710 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #164 from Mark Martinec <Mark.Martinec [at] ijs> 2009-11-17 03:03:22 UTC ---
> It appears that tests here are failing after commit because rules required by
> this test were zeroed out. It seems these rules have almost zero hits in
> masscheck. What should we do about this?

Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO
for the test
Sending t/missing_hb_separator.t
Committed revision 881240.

I hope this is the right approach. Alternative would be to introduce
a file similar to t/data/01_test_rules.cf to hold score overrides, but
with a name like 51_test_rules.cf to be sorted after the 50_scores.cf.
Btw, is the 01_ in the name intentional, or could the existing file
just be renamed to something like 99_test_rules.cf ?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 17, 2009, 3:18 AM

Post #143 of 165 (1709 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #165 from Mark Martinec <Mark.Martinec [at] ijs> 2009-11-17 03:18:15 UTC ---
(In reply to comment #161)
> -score RDNS_NONE 0.1
> -score RDNS_DYNAMIC 0.1
> +# score RDNS_NONE 0 1.1 0 0.7
> +# score RDNS_DYNAMIC 0 0.5 0 0.5

> Doesn't commented out mean 1 point?

It would mean 1 point, if there were no other score lines for these two rules:
score RDNS_DYNAMIC 2.639 0.363 1.663 0.982
score RDNS_NONE 2.399 1.274 1.228 0.793

> These are supposed to be informational rules according to the comment.
> Is this supposed to become commented out?

Comment 116, 120, 124, 137, 139.
I left it mutable, I think it still makes sense - it's kind of a poor man's
Botnet plugin.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 17, 2009, 7:41 AM

Post #144 of 165 (1696 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #166 from Justin Mason <jm [at] jmason> 2009-11-17 07:41:11 UTC ---
(In reply to comment #164)
> > It appears that tests here are failing after commit because rules required by
> > this test were zeroed out. It seems these rules have almost zero hits in
> > masscheck. What should we do about this?
>
> Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO
> for the test
> Sending t/missing_hb_separator.t
> Committed revision 881240.
>
> I hope this is the right approach. Alternative would be to introduce
> a file similar to t/data/01_test_rules.cf to hold score overrides, but
> with a name like 51_test_rules.cf to be sorted after the 50_scores.cf.
> Btw, is the 01_ in the name intentional, or could the existing file
> just be renamed to something like 99_test_rules.cf ?

X_MESSAGE_INFO can be dropped, but MISSING_HB_SEP should not have been made
mutable; I'd say lock to 2.5.

btw it is to be expected that with less mutability the scores become slightly
less optimal for the rescoring corpus; this always happens. If scores are
allowed to wander without locking down the "unsafe" rules, the GA will overfit
to the training data and produce great FP/FN figures, but scores that are risky
for "real world" usage.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 17, 2009, 7:56 AM

Post #145 of 165 (1693 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

AXB <alex.uribl [at] gmail> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |alex.uribl [at] gmail

--- Comment #167 from AXB <alex.uribl [at] gmail> 2009-11-17 07:56:17 UTC ---
(In reply to comment #166)
> (In reply to comment #164)
> > > It appears that tests here are failing after commit because rules required by
> > > this test were zeroed out. It seems these rules have almost zero hits in
> > > masscheck. What should we do about this?
> >
> > Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO
> > for the test
> > Sending t/missing_hb_separator.t
> > Committed revision 881240.
> >
> > I hope this is the right approach. Alternative would be to introduce
> > a file similar to t/data/01_test_rules.cf to hold score overrides, but
> > with a name like 51_test_rules.cf to be sorted after the 50_scores.cf.
> > Btw, is the 01_ in the name intentional, or could the existing file
> > just be renamed to something like 99_test_rules.cf ?
>
> X_MESSAGE_INFO can be dropped, but MISSING_HB_SEP should not have been made
> mutable; I'd say lock to 2.5.
>
> btw it is to be expected that with less mutability the scores become slightly
> less optimal for the rescoring corpus; this always happens. If scores are
> allowed to wander without locking down the "unsafe" rules, the GA will overfit
> to the training data and produce great FP/FN figures, but scores that are risky
> for "real world" usage.

locally, I've have lowered the MISSING_HB_SEP score to 0.5

lottsa funky ERP stuff seems to have a talent to FP on it.
its great for metas but usually triggers scores close to FP with the usual
suspects & their very ugly HTML formatting.
(sorry, cannot supply samples)

I'd say 2.5 is sorta high

Axb

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 20, 2009, 3:10 PM

Post #146 of 165 (1480 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #168 from Justin Mason <jm [at] jmason> 2009-11-20 15:10:05 UTC ---
(In reply to comment #167)
> locally, I've have lowered the MISSING_HB_SEP score to 0.5
>
> lottsa funky ERP stuff seems to have a talent to FP on it.
> its great for metas but usually triggers scores close to FP with the usual
> suspects & their very ugly HTML formatting.
> (sorry, cannot supply samples)
>
> I'd say 2.5 is sorta high

ok -- I was under the impression it was FP-free. 0.5 works for me in that
case.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 23, 2009, 8:08 PM

Post #147 of 165 (1274 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #169 from Warren Togami <wtogami [at] redhat> 2009-11-23 20:08:06 UTC ---
spamassassin/trunk/rulesrc/10_force_active.cf

It seems this file needs to be updated after the rescoring. Should all the
rules in 50_scores.cf be listed in 10_force_active.cf?

Even the rules that are zeroed out in 50_scores.cf?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 25, 2009, 3:39 PM

Post #148 of 165 (1132 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #170 from Warren Togami <wtogami [at] redhat> 2009-11-25 15:39:17 UTC ---
Created an attachment (id=4579)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4579)
patch for 10_force_active.cf

Nobody responded to the previous comment. I didn't know how this file was
generated before. I took 50_scores.cf and took all rule names that were not
commented out for this patch. Is this correct?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 26, 2009, 11:16 AM

Post #149 of 165 (1096 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #171 from Mark Martinec <Mark.Martinec [at] ijs> 2009-11-26 11:16:06 UTC ---
>> spamassassin/trunk/rulesrc/10_force_active.cf
>> It seems this file needs to be updated after the rescoring.
>> Should all the rules in 50_scores.cf be listed in 10_force_active.cf?
>> Even the rules that are zeroed out in 50_scores.cf?
>
> Nobody responded to the previous comment.
> I didn't know how this file was generated before.

No idea, sorry. I haven't been around that long.

> I took 50_scores.cf and took all rule names that were not
> commented out for this patch. Is this correct?

Probably.


Btw, the:
prove xt/10_rule_test_suite.t
is failing for several rules. Can someone more familiar with rules
please check where the reported problems lie?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Nov 26, 2009, 5:24 PM

Post #150 of 165 (1085 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #172 from Daryl C. W. O'Shea <spamassassin [at] dostech> 2009-11-26 17:24:49 UTC ---
Warren,

The file was originally used to list all *rules from sandboxes* that had scores
assigned by the GA so that they didn't get auto-demoted leaving a score line
but no rule.

I don't think its use has changed, but I'm not completely up-to-date on the
re-org of the rules source structure.

jm might have a script to generate the file... although it's been a long time.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

First page Previous page 1 2 3 4 5 6 7 Next page Last page  View All SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.