
KMcGrail at PCCC
Aug 8, 2012, 8:34 AM
Post #11 of 29
(349 views)
Permalink
|
|
Re: Script found that is aborting from insufficient ham
[In reply to]
|
|
On 8/8/2012 10:17 AM, John Hardin wrote: > On Wed, 8 Aug 2012, Kevin A. McGrail wrote: > >> On 8/7/2012 10:14 AM, John Hardin wrote: >>> On Tue, 7 Aug 2012, Kevin A. McGrail wrote: >>> >>> > Anyone else seeing missing corpora? >>> > > Is this possibly a problem where corpora are not being included? >>> >>> My uploaded corpora are not _missing_, but the number of messages >>> reported >>> for them in the corpora report on the masscheck results pages are far >>> lower than what is being uploaded. I've started rsync back down to >>> verify >>> and it's apparently not a matter of the upload failing. And I do >>> filter by >>> date before uploading so it's not a matter of my counting ten thousand >>> messages from 2002. >> >> Can you point me out the masscheck page that you are seeing the >> difference on? > > On any masscheck report, it's listed in two places: > > (1) in the "set 0, broken down by contributor" you can hover over the > hits for spam and ham for every corpus/result set and see the hits and > total messages used to calculate the percentage > > (2) at the bottom if you expand the "Corpus quality" report and see a > more detailed brakdown of the corpus/results contents > > Here are my corpora counts at my end (by the number of '^From\s'): > > fraud/spam: 5613 > fraud/ham: 0 > public/spam: 7173 > public/ham: 6069 > > Here are the numbers from the Corpus Quality report: > > bb-jhardin_fraud Spam messages Ham messages > TOTAL: 17 (0%) 1 (0%) > > bb-jhardin Spam messages Ham messages > TOTAL: 100 (0%) 235 (0%) > > I don't know where the single message in the fraud/ham corpus is from, > I may have uploaded a single dummy and forgetten about it. > > You can see the other corpora are either being counted/parsed > incorrectly or are being filtered somehow. > > Strangely enough, the count for the public/spam corpus is different > between the "set 0" count and the "Corpus quality" report: 67 vs. 100. Thanks. Can you confirm the exact url you are visiting for this report. I want to remove all assumptions from the mix. > I'm not uploading logs, I'm uploading the message corpora for > centralized masschecks. Are you sure? Are you uploading other than the logs? I show masscheck logs like these because you aren't actually uploading the emails (which is correct, I believe): -rw-r--r-- 1 rsync rsync 391675 Aug 8 09:15 ham-bb-jhardin.log -rw-r--r-- 1 rsync rsync 391679 Aug 7 09:16 ham-bb-jhardin.log~ -rw-r--r-- 1 rsync rsync 1145 Aug 8 09:17 ham-bb-jhardin_fraud.log -rw-r--r-- 1 rsync rsync 1145 Aug 7 09:19 ham-bb-jhardin_fraud.log~ -rw-r--r-- 1 rsync rsync 419449 Aug 4 09:07 ham-net-bb-jhardin.log -rw-r--r-- 1 rsync rsync 420618 Jul 28 09:06 ham-net-bb-jhardin.log~ -rw-r--r-- 1 rsync rsync 1220 Aug 4 09:09 ham-net-bb-jhardin_fraud.log -rw-r--r-- 1 rsync rsync 1220 Jul 28 09:08 ham-net-bb-jhardin_fraud.log~ -rw-r--r-- 1 rsync root 4639820 Oct 1 2009 ham-rescore-bb-jhardin.log -rw-r--r-- 1 rsync rsync 222982 Aug 8 09:15 spam-bb-jhardin.log -rw-r--r-- 1 rsync rsync 226858 Aug 7 09:16 spam-bb-jhardin.log~ -rw-r--r-- 1 rsync rsync 67181 Aug 8 09:17 spam-bb-jhardin_fraud.log -rw-r--r-- 1 rsync rsync 67181 Aug 7 09:19 spam-bb-jhardin_fraud.log~ -rw-r--r-- 1 rsync rsync 226058 Aug 4 09:07 spam-net-bb-jhardin.log -rw-r--r-- 1 rsync rsync 232278 Jul 28 09:06 spam-net-bb-jhardin.log~ -rw-r--r-- 1 rsync rsync 37934 Aug 4 09:09 spam-net-bb-jhardin_fraud.log -rw-r--r-- 1 rsync rsync 25983 Jul 28 09:08 spam-net-bb-jhardin_fraud.log~ -rw-r--r-- 1 rsync root 2491637 Oct 1 2009 spam-rescore-bb-jhardin.log > >> Can you remind me of the issue so I can respond intelligently? > > When I run masschecks locally against an up-to-date repo, it is not > setting the message boundary RE properly end gets scads of > uninitialized variable errors trying to parse the corpus mailbox > files. Last I looked, I added some warn() output and it was setting > the default RE properly but then appeared to be resetting it later > somewhere. > Sorry about that. I've reopened the bug. I believe I thought that was resolved by the conf changes Mark Martinec made so I dropped it. Regards, KAM
|