Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: devel

[Bug 6155] generate new scores for 3.3.0 release

 

 

First page Previous page 1 2 3 4 5 6 7 Next page Last page  View All SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded


bugzilla-daemon at bugzilla

Oct 8, 2009, 10:37 AM

Post #51 of 165 (1945 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #86 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-08 10:37:23 PDT ---
> I guess we have no choice but to drop wt-en6 from the rescore GA.
> Should I drop it from nightly masscheck as well?

I can imagine such problem could also affect other users, especially
those not running SpamAssassin close to their MTA. I guess we can keep
the wt-en6 corpus (and similar, if identified), but keep in mind that FP
hits on DKIM_ADSP_DISCARD (and possibly on some other rule if identified)
should be disregarded. I already removed the "DKIM_ADSP_DISCARD" hit
from my copy of wt-en6 log.

If it turns out the undesired mail modifications are more common
in submitted corpora, we could perhaps re-run the GA on a subset
of logs know not to be suffering from the problem, and just fetch
the DKIM_* scores from results as obtained from this run.

The release notes could then say that one should lower the DKIM_ADSP_*
scores on installations where it is known that mail is not reaching
SpamAssassin in its pristine form (as received by the MTA).

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 8, 2009, 1:51 PM

Post #52 of 165 (1942 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #87 from Warren Togami <wtogami [at] redhat> 2009-10-08 13:51:31 PDT ---
(In reply to comment #86)
> The release notes could then say that one should lower the DKIM_ADSP_*
> scores on installations where it is known that mail is not reaching
> SpamAssassin in its pristine form (as received by the MTA).

This case or old ham where the sender subsequently changed their DKIM policy is
only an issue for masscheck, not production scanning. Lowering the DKIM scores
makes no sense then?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 9, 2009, 6:23 AM

Post #53 of 165 (1935 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #88 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-09 06:23:06 PDT ---
> > The release notes could then say that one should lower the DKIM_ADSP_*
> > scores on installations where it is known that mail is not reaching
> > SpamAssassin in its pristine form (as received by the MTA).
>
> This case or old ham where the sender subsequently changed their DKIM policy
> is only an issue for masscheck, not production scanning.

True for the case of old ham where the sender subsequently changed their DKIM
policy,
or for the case of expired signatures - these are only an issue with masscheck.

...but not the case of wt-en6, where mail is transformed by its path through
webmail. This is an issue both for masschecks, as well as for production runs.

> Lowering the DKIM scores makes no sense then?

If one knows that mail reaching SpamAssassin will be modified by his mail path,
then one must disable rules targeting mail forgery and depending on a pristine
mail, such as the DKIM_ADSP_DISCARD rule. Otherwise the rule would generate
FP score points for legitimate mail from domains publishing ADSP (explicitly
or through overrides).

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 9, 2009, 6:38 AM

Post #54 of 165 (1938 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #89 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-09 06:38:09 PDT ---
Created an attachment (id=4550)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4550)
resulting 50_scores.cf from garescorer runs

Ok, here it is as last, the auto-generated 50_scores.cf from garescorer runs
on all four sets, with no hand-tweaking of results (yet) ... to give us
something to digest and comment on, and can serve as the first approximation.
Some values are surprising or plain wrong, I'll comment on some later.

I used the submitted logs (tweaked as per Comment 78), with all the recent
updates to them as posted so far in this ticket. I left the BAYES scores
fully floating. I fixed at zero the DCC_REPUT_* scores and JM_SOUGHT_FRAUD_*,
as was discussed previously (as can be seen by the end of the attached file).
Eventually these will need to be set to some manually determined score.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 9, 2009, 6:49 AM

Post #55 of 165 (1934 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #90 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-09 06:49:27 PDT ---
To assess the quality and repeatability of results, here are the summaries
on all four score sets, each pair consists of a normal run on 90% of
entries, and a test run on remaining 10% of log entries.

The most interesting figures are the FP and FN percents, e.g. 0.028% and
0.961%,
in this clipping:
# False positives: 65 0.011% (0.028% of nonspam, 10580 weighted)
# False negatives: 3411 0.578% (0.961% of spam, 12054 weighted)


==========================================
gen-set0-5-5.0-25000-ga
SCORESET 0 : (no net, not bayes)

test (10%):
# SUMMARY for threshold 5.0:
# Correctly non-spam: 45335 98.03%
# Correctly spam: 39320 81.61%
# False positives: 913 1.97%
# False negatives: 8860 18.39%
# TCR(l=50): 0.883875 SpamRecall: 81.611% SpamPrec: 97.731%

scores (90%):
# SUMMARY for threshold 5.0:
# Correctly non-spam: 365397 48.193% (98.401% of non-spam corpus)
# Correctly spam: 314466 41.476% (81.286% of spam corpus)
# False positives: 5936 0.783% (1.599% of nonspam, 173347 weighted)
# False negatives: 72396 9.548% (18.714% of spam, 226867 weighted)
# Average score for spam: 10.0 nonspam: 1.4
# Average for false-pos: 5.6 false-neg: 3.1
# TOTAL: 758195 100.00%

==========================================
gen-set1-10-5.0-30000-ga
SCORESET 1: (net, no bayes)

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 46183 99.86%
# Correctly spam: 46648 96.82%
# False positives: 65 0.14%
# False negatives: 1532 3.18%
# TCR(l=50): 10.075282 SpamRecall: 96.820% SpamPrec: 99.861%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 370804 48.906% (99.858% of non-spam corpus)
# Correctly spam: 374579 49.404% (96.825% of spam corpus)
# False positives: 529 0.070% (0.142% of nonspam, 31804 weighted)
# False negatives: 12283 1.620% (3.175% of spam, 39385 weighted)
# Average score for spam: 17.4 nonspam: 0.4
# Average for false-pos: 5.8 false-neg: 3.2
# TOTAL: 758195 100.00%


==========================================
gen-set2-10-5.0-30000-ga
SCORESET 2: (no net, bayes)

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 29308 99.78%
# Correctly spam: 42344 95.69%
# False positives: 64 0.22%
# False negatives: 1907 4.31%
# TCR(l=50): 8.664774 SpamRecall: 95.690% SpamPrec: 99.849%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 234375 39.745% (99.864% of non-spam corpus)
# Correctly spam: 339736 57.612% (95.700% of spam corpus)
# False positives: 320 0.054% (0.136% of nonspam, 26164 weighted)
# False negatives: 15265 2.589% (4.300% of spam, 58794 weighted)
# Average score for spam: 10.4 nonspam: 0.6
# Average for false-pos: 5.4 false-neg: 3.9
# TOTAL: 589696 100.00%


==========================================
gen-set3-20-5.0-20000-ga
SCORESET 3: (net, bayes)

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 29342 99.90%
# Correctly spam: 43843 99.08%
# False positives: 30 0.10%
# False negatives: 408 0.92%
# TCR(l=50): 23.192348 SpamRecall: 99.078% SpamPrec: 99.932%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 234630 39.788% (99.972% of non-spam corpus)
# Correctly spam: 351590 59.622% (99.039% of spam corpus)
# False positives: 65 0.011% (0.028% of nonspam, 10580 weighted)
# False negatives: 3411 0.578% (0.961% of spam, 12054 weighted)
# Average score for spam: 18.5 nonspam: -0.1
# Average for false-pos: 5.4 false-neg: 3.5
# TOTAL: 589696 100.00%

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 9, 2009, 6:53 AM

Post #56 of 165 (1937 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #91 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-09 06:53:51 PDT ---
As can be seen from above, the scoreset 0 (no net tests, no bayes) is pretty
much
useless nowadays. The scoresets 1 and 2 come close, i.e. net tests are worth
about as much as bayes. Of course the combination of all (set3) is an
outstanding winner.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 9, 2009, 8:22 PM

Post #57 of 165 (1927 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #92 from Warren Togami <wtogami [at] redhat> 2009-10-09 20:22:24 UTC ---
(In reply to comment #89)
> Created an attachment (id=4550)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4550) [details]
> resulting 50_scores.cf from garescorer runs
>
> Ok, here it is as last, the auto-generated 50_scores.cf from garescorer runs
> on all four sets, with no hand-tweaking of results (yet) ... to give us
> something to digest and comment on, and can serve as the first approximation.
> Some values are surprising or plain wrong, I'll comment on some later.

Bug #6156 RCVD_IN_PSBL
We should manually adjust this score something between 2.0 through 2.5 for
these reasons.

* Rescore masschecks were with deep parsing. We have subsequently changed it
to lastexternal which should be much safer. Even with deep parsing it proved
to be very good.
* At the time of the rescore masschecks, PSBL's recent whitelist filtering of
gmail, yahoo, rr.com and several other major ISP's had not yet timed out
legitimate MTA's. Safety should be improved further now.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 11, 2009, 12:01 AM

Post #58 of 165 (1909 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #93 from Warren Togami <wtogami [at] redhat> 2009-10-11 00:01:01 UTC ---
Bad news. Please remove the binnocenti logs from the rescore masschecks.
Working with him we discovered 50+ additional spam in his ham folders and there
is certainly more. Furthermore his ham contains lots of automated low quality
sources like Bugzilla, trac, cron and log monitoring daemons that should
probably be removed from ham corpa. It seems incorrect FP's and bias
introduced by this corpus can be large enough to possibly throw off scoring.

Did you also remove wt-en6 after we discovered that copying mail from a Yahoo
account corrupts the messages?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 11, 2009, 2:19 AM

Post #59 of 165 (1914 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Matthias Leisi <matthias [at] leisi> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |matthias [at] leisi

--- Comment #94 from Matthias Leisi <matthias [at] leisi> 2009-10-11 02:19:21 UTC ---
(In reply to comment #56)
> Here is a set of rules in 50_scores.cf that I ended up as fixed (immutable)
> for the GA run (score set 3). Most of these are already documented and labeled
> as such, but it doesn't hurt to post it here as a double-check.

I suspect that RCVD_IN_DNSWL_* should be immutable as well; in generated
scores, there are counter-intuitive scores assigned (expected _HI < _MED <
_LOW, observed _MED << _HI < _LOW).

https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf has the
following outside the "gen:mutable" section:

| score RCVD_IN_DNSWL_LOW 0 -1 0 -1
| score RCVD_IN_DNSWL_MED 0 -4 0 -4
| score RCVD_IN_DNSWL_HI 0 -8 0 -8

The DNSWL stats posted by Warren to the users list seem to indicate that this
should be the correct ordering (at least based on safety):

| SPAM% HAM% RANK RULE
| 0.0016% 4.2489% 0.91 RCVD_IN_DNSWL_HI
| 0.0281% 6.9639% 0.90 RCVD_IN_DNSWL_MED
| 0.1147% 3.9169% 0.81 RCVD_IN_DNSWL_LOW

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 11, 2009, 7:03 AM

Post #60 of 165 (1921 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #95 from Warren Togami <wtogami [at] redhat> 2009-10-11 07:03:21 UTC ---
(In reply to comment #94)
> The DNSWL stats posted by Warren to the users list seem to indicate that this
> should be the correct ordering (at least based on safety):
>
> | SPAM% HAM% RANK RULE
> | 0.0016% 4.2489% 0.91 RCVD_IN_DNSWL_HI
> | 0.0281% 6.9639% 0.90 RCVD_IN_DNSWL_MED
> | 0.1147% 3.9169% 0.81 RCVD_IN_DNSWL_LOW

These were yesterday's weekly results, not the rescore masscheck. Weekly
results are a smaller sample size and lower confidence.

http://ruleqa.spamassassin.org/20090930-r808953-n

SPAM% HAM% RANK RULE
0.0002% 0.3651% 0.75 RCVD_IN_DNSWL_HI
0.0288% 18.6970% 0.79 RCVD_IN_DNSWL_MED
0.0753% 8.1433% 0.68 RCVD_IN_DNSWL_LOW

This was the rescore masscheck.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 14, 2009, 4:21 PM

Post #61 of 165 (1805 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Mark Martinec <Mark.Martinec [at] ijs> changed:

What |Removed |Added
----------------------------------------------------------------------------
Attachment #4550|0 |1
is obsolete| |

--- Comment #96 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-14 16:21:44 UTC ---
Created an attachment (id=4553)
--> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4553)
resulting 50_scores.cf from garescorer runs - V2

Here is now a 50_scores.cf from my second attempt after cleaning some
logs: removed binnocenti and wt-en6 logs as per Comment 93, removed
DKIM_ADSP_DISCARD hits from ham-bayes-net-bluestreak.log. I have also
limited the log entries to fewer months following the RescoreMassCheck
(wiki): -m 6 for spam, and -m 25 for ham (after 25th month there is a
large gap in data till the next peak, too far in the past).

This leaves us with the following number of entries in merged logs:
score set 1 (no data from score set 3), provides data for set0 and set1:
360070 ham-full-set1.log
472682 spam-full-set1.log
score set 3, provides data for set2 and set3:
210603 ham-full-set3.log
442709 spam-full-set3.log

For DCC_ rules, I took the DCC_CHECK value of 1.1 from a preliminary run
which had all the DCC_REPUT_* scores fixed at 0, then for the next run
I fixed the DCC_CHECK, but left the DCC_REPUT_* scores floating. This
should cope with both types of sites: those with a commercial license
that do receive reputation results from DCC servers, and those who don't.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 14, 2009, 4:29 PM

Post #62 of 165 (1819 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #97 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-14 16:29:29 UTC ---
gen-set0-5-5.0-10000-ga
test (10%)
# SUMMARY for threshold 5.0:
# Correctly non-spam: 35461 98.50%
# Correctly spam: 38357 81.35%
# False positives: 541 1.50%
# False negatives: 8794 18.65%
# TCR(l=50): 1.315450 SpamRecall: 81.349% SpamPrec: 98.609%
scores (90%):
# SUMMARY for threshold 5.0:
# Correctly non-spam: 283119 42.494% (98.304% of non-spam corpus)
# Correctly spam: 306367 45.984% (80.997% of spam corpus)
# False positives: 4886 0.733% (1.696% of nonspam, 179777 weighted)
# False negatives: 71879 10.789% (19.003% of spam, 231331 weighted)
# Average score for spam: 10.4 nonspam: 1.7
# Average for false-pos: 5.6 false-neg: 3.2
# TOTAL: 666251 100.00%

gen-set1-10-5.0-10000-ga
test:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 35942 99.83%
# Correctly spam: 45983 97.52%
# False positives: 60 0.17%
# False negatives: 1168 2.48%
# TCR(l=50): 11.312620 SpamRecall: 97.523% SpamPrec: 99.870%
scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 287639 43.173% (99.873% of non-spam corpus)
# Correctly spam: 368783 55.352% (97.498% of spam corpus)
# False positives: 366 0.055% (0.127% of nonspam, 27040 weighted)
# False negatives: 9463 1.420% (2.502% of spam, 29645 weighted)
# Average score for spam: 20.3 nonspam: 0.2
# Average for false-pos: 5.6 false-neg: 3.1
# TOTAL: 666251 100.00%

gen-set2-10-5.0-10000-ga
test:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 35949 99.85%
# Correctly spam: 44538 94.46%
# False positives: 53 0.15%
# False negatives: 2613 5.54%
# TCR(l=50): 8.958959 SpamRecall: 94.458% SpamPrec: 99.881%
scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 287557 43.160% (99.844% of non-spam corpus)
# Correctly spam: 357656 53.682% (94.556% of spam corpus)
# False positives: 448 0.067% (0.156% of nonspam, 33456 weighted)
# False negatives: 20590 3.090% (5.444% of spam, 73371 weighted)
# Average score for spam: 12.3 nonspam: 0.8
# Average for false-pos: 5.7 false-neg: 3.6
# TOTAL: 666251 100.00%

gen-set3-20-5.0-10000-ga
test:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 21173 99.92%
# Correctly spam: 43749 99.08%
# False positives: 17 0.08%
# False negatives: 404 0.92%
# TCR(l=50): 35.209729 SpamRecall: 99.085% SpamPrec: 99.961%
scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168159 32.186% (99.976% of non-spam corpus)
# Correctly spam: 350875 67.159% (99.046% of spam corpus)
# False positives: 40 0.008% (0.024% of nonspam, 9039 weighted)
# False negatives: 3379 0.647% (0.954% of spam, 11476 weighted)
# Average score for spam: 19.3 nonspam: -0.8
# Average for false-pos: 5.4 false-neg: 3.4
# TOTAL: 522453 100.00%

===========
In summary, the essential data:

score set 0 (no net, no bayes):
# False positives: 4886 0.733% (1.696% of nonspam, 179777 weighted)
# False negatives: 71879 10.789% (19.003% of spam, 231331 weighted)

score set 1 (net, no bayes):
# False positives: 366 0.055% (0.127% of nonspam, 27040 weighted)
# False negatives: 9463 1.420% (2.502% of spam, 29645 weighted)

score set 2 (no net, bayes):
# False positives: 448 0.067% (0.156% of nonspam, 33456 weighted)
# False negatives: 20590 3.090% (5.444% of spam, 73371 weighted)

score set 3 (net, bayes):
# False positives: 40 0.008% (0.024% of nonspam, 9039 weighted)
# False negatives: 3379 0.647% (0.954% of spam, 11476 weighted)

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 14, 2009, 4:48 PM

Post #63 of 165 (1811 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #98 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-14 16:48:26 UTC ---
The RCVD_IN_DNSWL_* scores are again unusual:
score RCVD_IN_DNSWL_HI 0 -0.466 0 -0.001
score RCVD_IN_DNSWL_LOW 0 -0.292 0 -0.760
score RCVD_IN_DNSWL_MED 0 -1.703 0 -0.727

probably because of their low frequency, especially the _HI rule:
OVERALL SPAM% HAM% S/O RANK SCORE NAME
0.184 0.0007 0.5707 0.001 0.76 -1.00 RCVD_IN_DNSWL_HI
7.408 0.1096 22.7509 0.005 0.67 -1.00 RCVD_IN_DNSWL_MED
2.553 0.1816 7.5365 0.024 0.59 -1.00 RCVD_IN_DNSWL_LOW

and resulting zero ranges (tmp/ranges.data):
0.000 0.000 0 RCVD_IN_DNSWL_HI
0.000 0.000 0 RCVD_IN_DNSWL_MED
0.000 0.000 0 RCVD_IN_DNSWL_LOW

Don't know what a clean solution is, apart from fixing their scores
manually.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 14, 2009, 9:59 PM

Post #64 of 165 (1803 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #99 from Warren Togami <wtogami [at] redhat> 2009-10-14 21:58:58 UTC ---
I just discovered that I was falsely triggering rules like RCVD_IN_SORBS_DUL,
RCVD_IN_PBL or RDNS_DYNAMIC on some of my corpus ham due to a misconfiguration
on my server. My users delivering mail directly to other users on my server
from their home ISP or mobile phone were lacking "authenticated user" within
the Received header causing many hits on these and unknown other rules.
Roughly ~150-170 of my FP's on these three rules should not count against those
rules. Nearly all of my RCVD_IN_SORBS_DUL and RCVD_IN_PBL should have been
AllTrusted instead. Is this enough to throw off the GA scoring?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 15, 2009, 11:56 AM

Post #65 of 165 (1789 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #100 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-15 11:56:23 UTC ---
Btw, I added a "Target Milestone" 3.3.1, so that a triage on 3.3.0 bugs
may be more selective, choosing between Future/Undefined/3.3.1

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 19, 2009, 7:54 AM

Post #66 of 165 (1695 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #101 from Justin Mason <jm [at] jmason> 2009-10-19 07:53:59 UTC ---
(In reply to comment #98)
> The RCVD_IN_DNSWL_* scores are again unusual:
> score RCVD_IN_DNSWL_HI 0 -0.466 0 -0.001
> score RCVD_IN_DNSWL_LOW 0 -0.292 0 -0.760
> score RCVD_IN_DNSWL_MED 0 -1.703 0 -0.727
>
> probably because of their low frequency, especially the _HI rule:
> OVERALL SPAM% HAM% S/O RANK SCORE NAME
> 0.184 0.0007 0.5707 0.001 0.76 -1.00 RCVD_IN_DNSWL_HI
> 7.408 0.1096 22.7509 0.005 0.67 -1.00 RCVD_IN_DNSWL_MED
> 2.553 0.1816 7.5365 0.024 0.59 -1.00 RCVD_IN_DNSWL_LOW
>
> and resulting zero ranges (tmp/ranges.data):
> 0.000 0.000 0 RCVD_IN_DNSWL_HI
> 0.000 0.000 0 RCVD_IN_DNSWL_MED
> 0.000 0.000 0 RCVD_IN_DNSWL_LOW
>
> Don't know what a clean solution is, apart from fixing their scores
> manually.

feel free to fix them; it's hard for the GA to be mostly right about network
rules. tbh I'm surprised the ranges were zeroed (for _MED at least).

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 19, 2009, 7:55 AM

Post #67 of 165 (1697 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #102 from Justin Mason <jm [at] jmason> 2009-10-19 07:55:57 UTC ---
(In reply to comment #99)
> I just discovered that I was falsely triggering rules like RCVD_IN_SORBS_DUL,
> RCVD_IN_PBL or RDNS_DYNAMIC on some of my corpus ham due to a misconfiguration
> on my server. My users delivering mail directly to other users on my server
> from their home ISP or mobile phone were lacking "authenticated user" within
> the Received header causing many hits on these and unknown other rules.
> Roughly ~150-170 of my FP's on these three rules should not count against those
> rules. Nearly all of my RCVD_IN_SORBS_DUL and RCVD_IN_PBL should have been
> AllTrusted instead. Is this enough to throw off the GA scoring?

if you want, feel free to sed the log files to fix this, or just remove the
lines entirely, and reupload. 170 FPs for those DUL rules is quite strong imo.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 19, 2009, 10:31 AM

Post #68 of 165 (1694 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #103 from Warren Togami <wtogami [at] redhat> 2009-10-19 10:31:26 UTC ---
> if you want, feel free to sed the log files to fix this, or just remove the
> lines entirely, and reupload. 170 FPs for those DUL rules is quite strong imo.

Removed the majority of the offending lines and reuploaded ham-rescore-wt*.log.

I also zeroed out *wt-en6.log because they were found to be too corrupted to
trust the results.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 19, 2009, 11:28 AM

Post #69 of 165 (1691 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #104 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-19 11:28:49 UTC ---
(In reply to comment #103)
> Removed the majority of the offending lines and reuploaded ham-rescore-wt*.log.
> I also zeroed out *wt-en6.log because they were found to be too corrupted to
> trust the results.

Thanks. Seems you did it in the 'corpus' rsync directory. Please also update
them in the 'submit' directory using existing names, otherwise in few weeks
time we'll all forget which file came from where - after all, the 'submit'
directory is the official source for rescoring runs.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 19, 2009, 12:22 PM

Post #70 of 165 (1686 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #105 from Karsten Bräckelmann <guenther [at] rudersport> 2009-10-19 12:21:56 UTC ---
Argh, late to the show, sorry. :-/ From the second GA re-score run, attachment
4553 (aligned for readability):

score KB_RATWARE_MSGID 4.099 3.315 4.095 1.475

This is awesome! :) Though unrelated, so let me move on to the issue.


score KB_RATWARE_OUTLOOK_08 1.100 3.232 0.776 0.025
score KB_RATWARE_OUTLOOK_12 2.734 2.826 1.654 0.041
score KB_RATWARE_OUTLOOK_16 1.725 3.331 2.532 0.887
score KB_RATWARE_OUTLOOK_MID 2.259 2.485 3.121 0.001

This is also awesome -- kind of. But frankly, it also is a total mess. They are
essentially the same, just slightly differing in strictness or fuzziness. They
are almost *exactly* overlapping -- *all* four of them (see ruleqa).

These rules are really redundant, and there should be only one instead. FWIW,
that *should* be KB_RATWARE_BOUNDARY, which was added specifically for this.
This rule seems to be missing entirely, though. :(

Looking at the scores, I don't think simply adding them would do.

Also, I'm kind of un-satisfied with the score-set 3 scores. The FP rate is 0!
(Almost, I'll challenge the ham hits.) For all five rules above. Net tests or
not...

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 19, 2009, 12:35 PM

Post #71 of 165 (1687 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #106 from Warren Togami <wtogami [at] redhat> 2009-10-19 12:35:30 UTC ---
> Thanks. Seems you did it in the 'corpus' rsync directory. Please also update
> them in the 'submit' directory using existing names, otherwise in few weeks
> time we'll all forget which file came from where - after all, the 'submit'
> directory is the official source for rescoring runs.

Fixed in 'submit'.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 19, 2009, 2:26 PM

Post #72 of 165 (1684 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #107 from Justin Mason <jm [at] jmason> 2009-10-19 14:26:25 UTC ---
(In reply to comment #105)
> score KB_RATWARE_OUTLOOK_08 1.100 3.232 0.776 0.025
> score KB_RATWARE_OUTLOOK_12 2.734 2.826 1.654 0.041
> score KB_RATWARE_OUTLOOK_16 1.725 3.331 2.532 0.887
> score KB_RATWARE_OUTLOOK_MID 2.259 2.485 3.121 0.001
>
> This is also awesome -- kind of. But frankly, it also is a total mess. They are
> essentially the same, just slightly differing in strictness or fuzziness. They
> are almost *exactly* overlapping -- *all* four of them (see ruleqa).
>
> These rules are really redundant, and there should be only one instead. FWIW,
> that *should* be KB_RATWARE_BOUNDARY, which was added specifically for this.
> This rule seems to be missing entirely, though. :(
>
> Looking at the scores, I don't think simply adding them would do.
>
> Also, I'm kind of un-satisfied with the score-set 3 scores. The FP rate is 0!
> (Almost, I'll challenge the ham hits.) For all five rules above. Net tests or
> not...

it looks like they overlap a lot with some other rules. But yes, if they were
just 1 rule, it probably would have gotten a better single score.

I'm not sure if it's too late to fix this or not. :(

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 19, 2009, 2:49 PM

Post #73 of 165 (1689 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #108 from Karsten Bräckelmann <guenther [at] rudersport> 2009-10-19 14:49:17 UTC ---
(In reply to comment #107)
> it looks like they overlap a lot with some other rules. But yes, if they were
> just 1 rule, it probably would have gotten a better single score.
>
> I'm not sure if it's too late to fix this or not. :(

Frankly, pretty much either one could be used and all other variants simply be
dropped for the next re-score run. Keeping all of them is just a waste of
cycles.

The important questions are, where is KB_RATWARE_BOUNDARY, which was
specifically pushed right before the deadline to supersede these?

And of course, why do the scores drop that drastically with score-set 3, if
there is *no* FP? Regardless of the spam already scoring above 5, there is no
FP reason to lower the score.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 19, 2009, 3:37 PM

Post #74 of 165 (1701 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #109 from Karsten Bräckelmann <guenther [at] rudersport> 2009-10-19 15:37:16 UTC ---
(In reply to comment #108)
> The important questions are, where is KB_RATWARE_BOUNDARY, which was
> specifically pushed right before the deadline to supersede these?

Argh! It is in freqs.full, attachment 4541. However, it appears we've been
using inconsistent rule-sets, with most contributors using one outdated
rule-set or the other. :-(

10.830 14.1437 0.1901 0.987 0.67 0.00 T_KB_RATWARE_BOUNDARY
0.025 0.0327 0.0000 1.000 0.65 1.00 KB_RATWARE_BOUNDARY

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 20, 2009, 3:47 AM

Post #75 of 165 (1651 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #110 from Justin Mason <jm [at] jmason> 2009-10-20 03:46:49 UTC ---
(In reply to comment #109)
> (In reply to comment #108)
> > The important questions are, where is KB_RATWARE_BOUNDARY, which was
> > specifically pushed right before the deadline to supersede these?
>
> Argh! It is in freqs.full, attachment 4541 [details]. However, it appears we've been
> using inconsistent rule-sets, with most contributors using one outdated
> rule-set or the other. :-(
>
> 10.830 14.1437 0.1901 0.987 0.67 0.00 T_KB_RATWARE_BOUNDARY
> 0.025 0.0327 0.0000 1.000 0.65 1.00 KB_RATWARE_BOUNDARY

mysterious:

: exit=[130] uid=jm Tue Oct 20 10:40:30 GMT 2009; cd
/export/home/corpus-rsync/corpus/submit
: 6...; grep KB_RATWARE_BOUNDARY *.log | grep -v T_KB_RATWARE_BOUNDARY
: exit=[0 1] uid=jm Tue Oct 20 10:43:41 GMT 2009; cd
/export/home/corpus-rsync/corpus/submit

I can't find any non-T_ hits in the submit logs. Mark?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

First page Previous page 1 2 3 4 5 6 7 Next page Last page  View All SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.