Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: devel

[Bug 6155] generate new scores for 3.3.0 release

 

 

First page Previous page 1 2 3 4 5 6 7 Next page Last page  View All SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded


bugzilla-daemon at bugzilla

Sep 30, 2009, 9:03 AM

Post #26 of 165 (2125 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

John Hardin <jhardin [at] impsec> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |jhardin [at] impsec

--- Comment #61 from John Hardin <jhardin [at] impsec> 2009-09-30 09:03:42 PDT ---
(In reply to comment #60)
>
> I've also copied the current set of logs to ruleqa ...
>
> : 60...; wc -l submit/spam-*.log
> 1418 submit/spam-bayes-net-bb-guenther_fraud.log
> 1846 submit/spam-bayes-net-bb-jhardin.log
> 2200 submit/spam-bayes-net-bb-kmcgrail.log
>
> : 61...; wc -l submit/ham-*.log
> 9 submit/ham-bayes-net-bb-guenther_fraud.log
> 4307 submit/ham-bayes-net-bb-jhardin.log
> 6 submit/ham-bayes-net-bb-kmcgrail.log

There should also be jhardin_fraud logs, should there not? I _am_ submitting
daily corpora updates for sought_fraud, and those should be included just as
guenther's are...

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Sep 30, 2009, 12:11 PM

Post #27 of 165 (2119 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #62 from Karsten Bräckelmann <guenther [at] rudersport> 2009-09-30 12:11:23 PDT ---
(In reply to comment #60)
> 9 submit/ham-bayes-net-bb-guenther_fraud.log
^^^
Please do *not* include my fraud ham corpus. It exclusively contains fake,
artificial messages to exclude some German [1] from the fraud spam corpus. No
real ham there.

My spam corpus of course is fine to include.

[1] Short, broken German paragraphs along the lines of "you may write in
German,
too", in an otherwise entirely English spam.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Sep 30, 2009, 2:35 PM

Post #28 of 165 (2124 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #63 from John Hardin <jhardin [at] impsec> 2009-09-30 14:35:22 PDT ---
(In reply to comment #62)
> (In reply to comment #60)
> > 9 submit/ham-bayes-net-bb-guenther_fraud.log
> ^^^
> Please do *not* include my fraud ham corpus. It exclusively contains fake,
> artificial messages to exclude some German [1] from the fraud spam corpus.

Same goes for my fraud ham corpus, except s/German/English/ (primarily free
mail adverts and legal disclaimers).

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Sep 30, 2009, 2:44 PM

Post #29 of 165 (2138 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #64 from Warren Togami <wtogami [at] redhat> 2009-09-30 14:44:35 PDT ---
http://ruleqa.spamassassin.org/20090930-r808953-n/RCVD_IN_PSBL/detail
It looks like all the ham is visible in the ruleqa, but only 86390 spam?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Sep 30, 2009, 3:16 PM

Post #30 of 165 (2140 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #65 from Justin Mason <jm [at] jmason> 2009-09-30 15:16:17 PDT ---
yep, that's not right :( I've deleted the files, let's see if the backend
rebuilds them correctly using all logs this time.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Sep 30, 2009, 6:53 PM

Post #31 of 165 (2120 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #66 from Daryl C. W. O'Shea <spamassassin [at] dostech> 2009-09-30 18:53:50 PDT ---
(In reply to comment #60)
> we could probably skip some of the spam.

If you feel that it's detrimental too include that much sure. I'd start with
dropping from your and my corpora. I've got spam up to 60 days old in my
corpus. I'd include everyone elses' spam and thin ours out rather than just a
straight drop by date method.

If it's solely a processing time concern, I'd say it's a non-issue as the GA
doesn't take that long to run. I know the nightly ones (about half as much
mail) take around 30 minutes on the ancient machine I've got it running on.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Sep 30, 2009, 7:04 PM

Post #32 of 165 (2120 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #67 from Warren Togami <wtogami [at] redhat> 2009-09-30 19:04:34 PDT ---
http://ruleqa.spamassassin.org/20090930-r808953-n
Was that re-run? The same total number of spam: 86390

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 1, 2009, 5:50 AM

Post #33 of 165 (2095 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #68 from Justin Mason <jm [at] jmason> 2009-10-01 05:50:15 PDT ---
(In reply to comment #67)
> http://ruleqa.spamassassin.org/20090930-r808953-n
> Was that re-run? The same total number of spam: 86390

it took a little time, but it appears to have corrected itself now. I think
there's a race condition to do with the way logs are rsynced from
spamassassin.zones to spamassassin2.zones. :(

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 5, 2009, 8:00 PM

Post #34 of 165 (2054 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #69 from Warren Togami <wtogami [at] redhat> 2009-10-05 20:00:00 PDT ---
Hey Mark, is the GA run happening while jm is away?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 6, 2009, 3:46 AM

Post #35 of 165 (2040 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #70 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-06 03:46:36 PDT ---
> Hey Mark, is the GA run happening while jm is away?

Yes, it is underway just now. I needed to figure out how to set up the
mpich2 message-passing environment, but I think I have it working now.

I will be asking contributors to check some apparent FP and FN in their
logs soon...

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 6, 2009, 7:08 AM

Post #36 of 165 (2033 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #71 from Warren Togami <wtogami [at] redhat> 2009-10-06 07:08:46 PDT ---
> I will be asking contributors to check some apparent FP and FN in their
> logs soon...

The longer you wait, more of the logs ID's will no longer match the mail boxes.

BTW, did you do the things written in Comment #38?

So scoring PSBL might be more complicated than this.

* RCVD_IN_PSBL_2WEEKS was never meant to be published as a run-time rule. It
is valuable in measuring PSBL in masschecks.
* It seems that PSBL is not set to allow reuse?
* PSBL as measured in the rescore masscheck was deep parsing, while we
subsequently agreed to change it to lastexternal.

What should we do?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 6, 2009, 12:33 PM

Post #37 of 165 (2037 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #72 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-06 12:33:09 PDT ---
> The longer you wait, more of the logs ID's will no longer match the mail boxes.

The messages whose results are submitted to rescoring are supposed to be
preserved,
at least until the rescoring runs are done.

> BTW, did you do the things written in Comment #38?

Not yet, will do in my next iteration. It takes a couple of hours.
The JM_SOUGHT results I kept on purpose for now, wondering what their
scores would be. On the next round I can just force them to zero,
I believe this is equivalent to removing them from the logs.
In the first round I got:
score JM_SOUGHT_FRAUD_1 2.105
score JM_SOUGHT_FRAUD_2 2.318
score JM_SOUGHT_FRAUD_3 3.270

> So scoring PSBL might be more complicated than this.
>
> * RCVD_IN_PSBL_2WEEKS was never meant to be published as a run-time rule. It
> is valuable in measuring PSBL in masschecks.
> * It seems that PSBL is not set to allow reuse?
> * PSBL as measured in the rescore masscheck was deep parsing, while we
> subsequently agreed to change it to lastexternal.

I did the translations from Comment #38 now on the RCVD_IN_PSBL*, will get into
the next approximation.

> What should we do?

There seem to be some other rules in the works, so I'd say let's just finish
up whatever was frozen with a call for rescoring results, publish that as
beta-1,
then examine what we got, polish it, and to another rescoring run before the
final release. It's not too bad to just fix some scores manually, we're doing
it also for BAYES, SPF, etc.

==========

Here is now the first homework, the following were reported as false positives
on my last completed attempt. Please check if these are really ham messages
(I already checked my two entries, and they are):

ham-bayes-net-hege.log
/data/sa/h/3/36f18b49dd8ce2ce70586c67eeb780fd
/data/sa/h/0/0270ee166042abd0aa94cbdda855400c
/data/sa/h/9/9eb11730050002add51ecdc6ed25343d
/data/sa/h/5/5dfa06864bb3021674768e8af372a6c9
/data/sa/h/4/4214ade1e7e177f0453c5f1cc98c8b42

ham-bayes-net-bluestreak.log
../../aaa_ham/2009-07_HAM_721117.0
../../aaa_ham/2009-06_HAM_602375.0
../../aaa_ham/2009-06_HAM_609153.0
../../aaa_ham/2009-06_HAM_623012.0
../../aaa_ham/2009-06_HAM_622736.0
../../aaa_ham/2009-08_HAM_814010.0

ham-bayes-net-dos.log
/home/dos/SA-corpus/ham/leah/
INBOX-Inbox-2007/1195695047.P9700Q22.dilbert.dostech.net:2,S
/home/dos/SA-corpus/ham/leah/
INBOX-Inbox-2007/1196258008.P18803Q16.dilbert.dostech.net:2,S
/home/dos/SA-corpus/ham/leah/
INBOX-Inbox-2007/1199765108.P20983Q90.dilbert.dostech.net:2,RS

ham-bayes-net-jm.log
/local/cor/recent/ham/priv.radish.jmason.org.200808310000.mbox.160968
/local/cor/recent/ham/priv.wall.200809081400.mbox.1677188
/local/cor/recent/ham/priv.20050914/126599

ham-bayes-net-mmartinec.log
ham/uYUQM2RmF9I0
ham/p+KSEyzZTPOw

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 6, 2009, 3:57 PM

Post #38 of 165 (2030 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #73 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-06 15:57:29 PDT ---
> Please check if these are really ham messages

and four more from the second run:

../../aaa_ham/2009-07_HAM_704334.0

../../aaa_ham/2009-08_HAM_810051.0

/local/cor/recent/ham/priv.20050914/137533

/home/dos/SA-corpus/ham/dos/Inbox-2008/

1221834769.M749008P21562V0000000000000302I00414902_237.cyan.dostech.net,S=26243:2,S


Also, I find scores on URIBL_(AB|JP|WS)_SURBL to be rather low compared
to my experience (e.g. one FP out of 39.000 on URIBL_WS_SURBL at my
ham-bayes-net-mmartinec.log), so my guess is that several of the
following hits could be false positives on these rules:

grep -c 'URIBL_WS_SURBL' ham-bayes-net-jm.log
178

grep -c 'URIBL_AB_SURBL' ham-bayes-net-jm.log
42

grep -c 'URIBL_JP_SURBL' ham-bayes-net-jm.log
29

grep -c 'URIBL_JP_SURBL' ham-bayes-net-bluestreak.log
28

egrep -c 'URIBL_(AB|JP|WS)_SURBL' ham-bayes-net-hege.log
7

grep -c 'URIBL_WS_SURBL' ham-bayes-net-dos.log
4

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 6, 2009, 6:06 PM

Post #39 of 165 (2022 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #74 from Daryl C. W. O'Shea <spamassassin [at] dostech> 2009-10-06 18:06:41 PDT ---
/home/dos/SA-corpus/ham/dos/Domains/1195543943.M277151P27837V0000000000000302I00154082_16.cyan.dostech.net\,S\=6338\:2\,S

...is an abuse report that contains an abused domain. I'd rm it from the logs.
I have from my corpus.

/home/dos/SA-corpus/ham/leah/INBOX-Inbox-2007/1195695047.P9700Q22.dilbert.dostech.net:2,S

...is ham. A user recommends somebody locally who I guess has spamed their
domain. I've left this in my corpus.

/home/dos/SA-corpus/ham/dos/infra-list/1204046401.M43776P15497V0000000000000302I0000C20E_0.cyan.dostech.net,S=5621:2,S

...abuse report. I'd rm it from the logs. I have from my corpus.

/home/dos/SA-corpus/ham/dos/infra-list/1253117012.M352778P19949V0000000000000302I008D1494_70.cyan.dostech.net,S=2683:2,

...abuse report. I'd rm it from the logs. I have from my corpus.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 6, 2009, 6:30 PM

Post #40 of 165 (2017 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #75 from Warren Togami <wtogami [at] redhat> 2009-10-06 18:30:19 PDT ---
Might we consider assigning different confidence weights to ham corpa?

For example, my ham corpa are relatively small in number, but I have strong
confidence that they are thoroughly cleaned. Furthermore they are extremely
varied in sources and likely to be different from other masscheck participants.
I have also filtered out all discussion mailing lists and automated report
sources.

For example, I would assign the following weights to my ham corpa:
wt-en1: x2.5
wt-en2: x2
wt-en3: x1.5
wt-en5: x2
wt-en6: x1
wt-jp1: x2.5
wt-jp2: x1.5

Anyhow, just an idea. Not sure if this is helpful.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 6, 2009, 11:20 PM

Post #41 of 165 (2021 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Henrik Krohns <hege [at] hege> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |hege [at] hege

--- Comment #76 from Henrik Krohns <hege [at] hege> 2009-10-06 23:20:34 PDT ---
I cleaned up my few FPs and some other stuff, new logs sent..

Talking about weights, does anyone have an academic answer on how results are
affected when some corpuses are uniqued (atleast mine is) and some are not?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 7, 2009, 7:13 AM

Post #42 of 165 (2018 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #77 from Warren Togami <wtogami [at] redhat> 2009-10-07 07:13:03 PDT ---
Nevermind about the weights idea.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 7, 2009, 9:56 AM

Post #43 of 165 (2016 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #78 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-07 09:56:41 PDT ---
> I cleaned up my few FPs and some other stuff, new logs sent..

Thanks to Daryl and Henrik, I'm still waiting for the bluestreak, but
meanwhile am running garescorer on what I have (including the recent updates).

Btw, Daryl, you haven't commented on:

/home/dos/SA-corpus/ham/leah/
INBOX-Inbox-2007/1196258008.P18803Q16.dilbert.dostech.net:2,S

/home/dos/SA-corpus/ham/leah/
INBOX-Inbox-2007/1199765108.P20983Q90.dilbert.dostech.net:2,RS

/home/dos/SA-corpus/ham/dos/
Inbox-2008/1221834769.M749008P21562V0000000000000302I00414902_237.\
cyan.dostech.net,S=26243:2,S


> Talking about weights, does anyone have an academic answer on how results are
> affected when some corpuses are uniqued (atleast mine is) and some are not?

Don't know. I removed exact duplicates on mail body from my corpus, although
due to 'personalized' spam which is becoming prevalent nowadays thanks to the
free CPU resources on botnets, there are still plenty of very similar yet
different messages left in the corpus. I did some manual removal on these,
but it is very impractical to be thorough.


> Might we consider assigning different confidence weights to ham corpa?
>
> For example, my ham corpa are relatively small in number, but I have strong
> confidence that they are thoroughly cleaned. Furthermore they are extremely
> varied in sources and likely to be different from other masscheck participants.
> I have also filtered out all discussion mailing lists and automated report

I do recognize that corpora are quite different in several aspects, although
I don't know how one can weight them more fairly and incorporate it into
the current procedure.

Let me just document here what I'm doing now with a local copy of all
submitted logs.

Due to a significant disproportion on the size of spam-bayes-net-dos.log
and spam-bayes-net-jm.log compared to the rest, I'm taking a random sample
of each of these files, restricted to scoreset 3 and age below 6 months,
decimated to 150.000 entries each (I initially used 100.000, but now
bumped it up).

There are some spam log entries older than 6 months on other spam logs, but
not too many (mostly on the 'hege' collection), but as it seems these are
mainly hand-selected fraud samples, I'm keeping these regardless of age.

Due to shortage of ham, I'm keeping it all regardless of age. This mainly
goes for JM's ham collection, which contains some (smaller) share of
older ham; the remaining collections are fairly recent.

There are no scoreset 0 and 2 entries in any of the logs. So for the
scoreset 3 and 2 I'm using a selection from the logs with 'set=3'.
For scoresets 0 and 1 runs I'm using all entries (set=1 and set=3).

This all amounts to the following 'wc -l' counts:

463957 ham-full-set1.log
483402 spam-full-set1.log

293637 ham-full-set3.log
443635 spam-full-set3.log

This seems reasonably fair and balanced to me.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 7, 2009, 10:30 AM

Post #44 of 165 (2029 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #79 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-07 10:30:25 PDT ---
> There are some spam log entries older than 6 months on other spam logs, but
> not too many (mostly on the 'hege' collection), but as it seems these are
> mainly hand-selected fraud samples, I'm keeping these regardless of age.

Oops, wrong id: s/hege/jhardin/

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 7, 2009, 10:59 AM

Post #45 of 165 (2017 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #80 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-07 10:59:07 PDT ---
The following also looks fishy:

grep -c DKIM_ADSP_DISCARD ham*.log

ham-bayes-net-bb-fredt.log 21
ham-bayes-net-bb-jhardin.log 22
ham-bayes-net-bluestreak.log 36
ham-bayes-net-hege.log 43
ham-bayes-net-wt-en6.log 35
ham-bayes-net-mmartinec.log 1
ham-bayes-net-dos.log 25
ham-bayes-net-jm.log 65

(the one entry in my collection is due to the author posting
through a mailing list, despite the fact that his domain publishes
a 'discardable' policy; so, a sender's mistake)

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 7, 2009, 12:37 PM

Post #46 of 165 (1999 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #81 from Warren Togami <wtogami [at] redhat> 2009-10-07 12:37:09 PDT ---
(In reply to comment #80)
> The following also looks fishy:
>
> grep -c DKIM_ADSP_DISCARD ham*.log
>
> ham-bayes-net-wt-en6.log 35
>
> (the one entry in my collection is due to the author posting
> through a mailing list, despite the fact that his domain publishes
> a 'discardable' policy; so, a sender's mistake)

These are all legitimate looking paypal mail delivered to a Yahoo account from
mid-2008 through recently.

What is DKIM_ADSP_DISCARD supposed to mean?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 7, 2009, 3:56 PM

Post #47 of 165 (1992 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #82 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-07 15:56:36 PDT ---
(In reply to comment #81)
> > The following also looks fishy:
> > grep -c DKIM_ADSP_DISCARD ham*.log
> > ham-bayes-net-wt-en6.log 35
>
> These are all legitimate looking paypal mail delivered to a Yahoo account
> from mid-2008 through recently.

I'm not sure since when paypal is signing their mail. They were certainly
signing it with DomainKeys signatures in 2006, and with DKIM in 2008.
So for very old ham mail from paypal (or ebay) it is quite possible the
signature is missing or somehow broken or unverifiable, but this shouldn't
be the case for current mail from these domains.

> What is DKIM_ADSP_DISCARD supposed to mean?

It means two things:
- that the message does not have a valid author's domain DKIM or DomainKeys
signature (e.g. there is no signature at all, or that the signature does
not match the mail contents, or that it does not match the domain name
in the From header field);
- and that the domain claims that any mail claiming to be from that domain
and failing on signature verification, should be discarded. This claim
is made by publishing a DNS record (RFC 5617), or through 'adsp_override'
configuration directive in SpamAssassin's .cf file.

So, if your mail samples are younger than a year, they do have a
DKIM-Signature in the header, and they appear to be genuine, the only
explanation for a failed signature verification is that the message got
somehow corrupted or transformed on its way to SpamAssassin in such a way
that the signature no longer matches the mail contents, or that SA could
not fetch the domain's public key, perhaps due to DNS resolver failing
or some firewall trouble.

Depending on your where and how SpamAssassin is called from your mail
delivery system, and how you collected your samples (e.g. from a MTA,
from a mailbox, from some kind of a quarantine), there are different
possible reasons for mail corruption. For example, saving a mail message
source from some MUA (e.g. kmail) can rewrite/reformat some header fields.
Running some virus scanner in the mail path may add its verdict to the
mail body. Fetching it from some POP3 server or even from a webmail service
offers their own challenges to mail integrity. In some cases even a
'friendly' MTA thinks it is doing a favour by rewriting some header
fields, perhaps in belief that they would look 'prettier'.

One way to find out is to describe a path the mail is making through
your infrastructure (firewall, MTA, virus scanners, mailbox server)
before it reaches SpamAssassin, and by carefully examining one or two
such mail samples. If you have a choice, you may mail me some samples,
preferably as a gzip or tar.gz attachment, to make sure it won't get
transformed in transition.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 8, 2009, 1:02 AM

Post #48 of 165 (2015 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #83 from Henrik Krohns <hege [at] hege> 2009-10-08 01:02:43 PDT ---
Cleaned up my DKIM_ADSP_DISCARD hits (old 2005 ebay mails removed) and some
other old stuff, logs sent..

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 8, 2009, 6:50 AM

Post #49 of 165 (1961 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #84 from Mark Martinec <Mark.Martinec [at] ijs> 2009-10-08 06:50:37 PDT ---
> These are all legitimate looking paypal mail delivered to a Yahoo account from
> mid-2008 through recently.

Thanks Warren for your out-of-band mail. Apart from some general comments
from my previous posting, there is a real problem regarding your method of
fetching mail for a Yahoo account. You are using the FetchYahoo to download
these messages from the Yahoo webmail interface. The FetchYahoo has to jump
hoops to be able to retrieve a message as close to its original form as
possible, but there are some real obstacles there. Glancing at its source
code, it has to pull attachments separately and splice them back together
into a message, necessarily reinventing the MIME boundaries. This is enough
to render DomainKeys and DKIM signatures invalid. Apart from this, it also
converts QP and base64 encoded messages into UTF-8 binary, which again is
a sufficient reason for signature breakage. Moreover, it has to repair some
damage to header field folding and empty lines, which are broken either due to
bugs in Yahoo HTML rendering (indicated by comments in the FetchYahoo code),
or details are simply lost because of a conversion to HTML and back to mail.

This method of fetching mail is bound to cause trouble. It may quite easily
cause some other low-level SpamAssassin rules to misfire or to fail triggering,
not just the signature verification failures.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


bugzilla-daemon at bugzilla

Oct 8, 2009, 10:16 AM

Post #50 of 165 (1990 views)
Permalink
[Bug 6155] generate new scores for 3.3.0 release [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #85 from Warren Togami <wtogami [at] redhat> 2009-10-08 10:15:55 PDT ---
I guess we have no choice but to drop wt-en6 from the rescore GA.

Should I drop it from nightly masscheck as well?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

First page Previous page 1 2 3 4 5 6 7 Next page Last page  View All SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.