jarif at iki
Jul 2, 2012, 4:37 PM
Post #21 of 21
On 3.7.2012 2:24, darxus [at] chaosreigns wrote:
Re: "jarif" corpus on Spamassassin masschecks
[In reply to]
> On 07/02, RW wrote:
>> On Mon, 2 Jul 2012 12:01:32 -0700 (PDT)
>> John Hardin wrote:
>>> On Mon, 2 Jul 2012, Jari Fredriksson wrote:
>>> That says to not include any _spams_ received via those channels, not
>>> to discard them _in toto_.
>> It actually says:
>> DO NOT include such mail in either ham or spam folder. Just delete it.
>> Why? We don't want to count these as spam, causing false marks against
>> highly safe whitelist rules like USER_IN_DEF_DKIM_WL. They do not count
>> as ham either, because spam URL's or spam text would throw off the
>> statistics if they show up in the ham folder. Simply delete them
> Jari had been deleting non-spam from facebook. As John said, that wiki
> page says to not include *spam* from places like facebook. Legit mail
> from facebook, which Jari had been deleting, has value when appropriately
> reported as non-spam.
My so far finalized version of the script deletes only 2 HAMs now from
the whole corpus.
bin/delete-unwanted-mail.sh: removing unwanted HAM mail from corpus
bin/delete-unwanted-mail.sh: removing unwanted SPAM mail from corpus
Those were not really bad ham, but they contained ^List-Id AND
^Received:.*MAILER-DAEMON in an attachment. I do not bother to do
something about those, they are rare examples of HAM. Sent by ezmail
from Debian because I had something wrong in my server and they tried to
send list post to me.
Only two deleted.
Among the lucky, you are the chosen one.