Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: devel

[Bug 6703] sa-learn doesn't work with kmail 2 mbox format

 

 

SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded


bugzilla-daemon at bugzilla

May 16, 2012, 3:36 PM

Post #1 of 15 (808 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

Mark Martinec <Mark.Martinec [at] ijs> changed:

What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|Undefined |3.4.0
Summary|sa-learn doesn't work with |sa-learn doesn't work with
|--mbox |kmail 2 mbox format

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 17, 2012, 12:39 PM

Post #2 of 15 (771 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #18 from Thomas Arend <thomas [at] arend-rhb> ---
Thanks to the hint to ArchiveIterator.pm I found two regex for checking the
From_ line in Mail::SpamAssassin::ArchiveIterator and replaced them with
following regex: (I am running Version 3.3.1)

/From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d
(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d
[+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [
0-2]\d:\d\d:\d\d \d{4})/ )

This replacement works fine for me and it is compatible with both mbox date
formats. My pattern checks the date stricter than the old pattern which may be
an advantage or an disadvantage.

diff (orig) (new)
399c399
< last if (substr($_,0,5) eq "From " && @msg && /^From \S+ ?\S\S\S \S\S\S
.\d .\d:\d\d:\d\d \d{4}/);
---
> last if (substr($_,0,5) eq "From " && @msg && /^From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ );
912c912,913
< /^From \S+ ?\S\S\S \S\S\S .\d .\d:\d\d:\d\d \d{4}/) {
---
> /From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/ ) {

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 17, 2012, 12:50 PM

Post #3 of 15 (769 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #19 from Thomas Arend <thomas [at] arend-rhb> ---
Upps there was a small copy error. There must be a ^ before "From "

/^From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d
(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d
[+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [
0-2]\d:\d\d:\d\d \d{4})/ )

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 17, 2012, 1:24 PM

Post #4 of 15 (773 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #20 from Kevin A. McGrail <kmcgrail [at] pccc> ---
(In reply to comment #19)
> Upps there was a small copy error. There must be a ^ before "From "
>
> /^From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d
> (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d
> [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [
> 0-2]\d:\d\d:\d\d \d{4})/ )

This regex scares me because of the localization issue. For example, Lun for
Monday in Spanish.

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 17, 2012, 1:26 PM

Post #5 of 15 (780 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #21 from Kevin A. McGrail <kmcgrail [at] pccc> ---
(In reply to comment #7)
> However, I can't find the ticket but I made some small tweaks recently-ish
> to trunk to help another mbox format issue. I wonder if trunk can parse
> this new format?

Bug 6413 and no, that was for communigate not kmail. Not relevant except it
touches on the same non-standardization for mbox formats.

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 17, 2012, 2:38 PM

Post #6 of 15 (771 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

John Hardin <jhardin [at] impsec> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |jhardin [at] impsec

--- Comment #22 from John Hardin <jhardin [at] impsec> ---
(In reply to comment #20)
> (In reply to comment #19)
> > Upps there was a small copy error. There must be a ^ before "From "
> >
> > /^From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d
> > (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d
> > [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [
> > 0-2]\d:\d\d:\d\d \d{4})/ )
>
> This regex scares me because of the localization issue. For example, Lun
> for Monday in Spanish.

Agreed.

How about:

/^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2} \d{4}
[0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [
0-2]\d:\d\d:\d\d \d{4})/

I'm assuming [:upper:] and [:lower:] will match accented characters properly. I
haven't tested that assumption.

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 19, 2012, 7:41 PM

Post #7 of 15 (764 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #23 from Kevin A. McGrail <kmcgrail [at] pccc> ---
Created attachment 5070
--> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5070&action=edit
Patch to add options for defining the ArchiveIterator From regex as a Conf file
option

> How about:
>
> /^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2}
> \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [
> 0-2]\d:\d\d:\d\d \d{4})/
>
> I'm assuming [:upper:] and [:lower:] will match accented characters
> properly. I haven't tested that assumption.

I don't know enough about foreign languages to know for sure the format is
always leading caps, etc. So I went ahead and wrote the patch to move this to a
configurable option.

It appears to work testing with the mbox with 3 ham messages attached
previously.

"Learned tokens from 3 message(s) (3 message(s) examined)"

Thoughts?

KAM

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 20, 2012, 4:54 AM

Post #8 of 15 (762 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #24 from Thomas Arend <thomas [at] arend-rhb> ---


(In reply to comment #22)
> (In reply to comment #20)
> > (In reply to comment #19)
> > > Upps there was a small copy error. There must be a ^ before "From "
> > >
> > > /^From \S+ ?(Mon|Tue|Wed|Thu|Fri|Sat|Sun)(, \d\d
> > > (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4} [0-2]\d:\d\d:\d\d
> > > [+-]\d{4}| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) [ 1-3]\d [
> > > 0-2]\d:\d\d:\d\d \d{4})/ )
> >
> > This regex scares me because of the localization issue. For example, Lun
> > for Monday in Spanish.
>
> Agreed.
>
> How about:
>
> /^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2}
> \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [
> 0-2]\d:\d\d:\d\d \d{4})/
>
> I'm assuming [:upper:] and [:lower:] will match accented characters
> properly. I haven't tested that assumption.

Are you sure that we have a localization issue in the header fields? I use the
German versions of Outlook Express, Thunderbird, Kmail and Evolution. The time
stamps in the header are not localized.

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 20, 2012, 5:09 AM

Post #9 of 15 (759 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #25 from Thomas Arend <thomas [at] arend-rhb> ---
It's getting worse with the kmail folks. The sometimes put to From_ lines in
the mbox file.

From thomas [at] arend-rhb Sat, 19 May 2012 00:10:44 +0200
From thomas [at] arend-rhb Sat May 19 00: 16:57 2012
[..]

In teh second line you can see that they spend an extra space in case the hour
would extend to more than 99 min in the future.

I will report this to kmail.

Have a nice Sunday!

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 20, 2012, 5:11 AM

Post #10 of 15 (760 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #26 from Thomas Arend <thomas [at] arend-rhb> ---
(In reply to comment #22)

>
> /^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2}
> \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [
> 0-2]\d:\d\d:\d\d \d{4})/

I doin't see the need of the sequence "(?:," in the regex. For me it works with
"(,"

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 20, 2012, 5:16 AM

Post #11 of 15 (764 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #27 from Thomas Arend <thomas [at] arend-rhb> ---
Created attachment 5071
--> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5071&action=edit
Test mbox showing two From_ lines

This is an export of four messages which show two from lines.

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 20, 2012, 6:41 AM

Post #12 of 15 (760 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #28 from Thomas Arend <thomas [at] arend-rhb> ---
(In reply to comment #27)
> Created attachment 5071 [details]
> Test mbox showing two From_ lines
>
> This is an export of four messages which show two from lines.

Just found that the corrupted From_ line was added when filtering through
formail. Don`t ask me why. Formail works on command line as expected.

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 20, 2012, 9:54 AM

Post #13 of 15 (756 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #29 from John Hardin <jhardin [at] impsec> ---
(In reply to comment #26)
> (In reply to comment #22)
>
> > /^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2}
> > \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [
> > 0-2]\d:\d\d:\d\d \d{4})/
>
> I doin't see the need of the sequence "(?:," in the regex. For me it works
> with "(,"

There is no reason to capture the results of the match for later use. (?:...)
is a non-capturing match, which is slightly more efficient.

To allow for variations in whitespace, perhaps:

/^From\s{1,5}\S+\s{1,5}[[:upper:]][[:lower:]]{2}(?:, \d\d
[[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}|
[[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 20, 2012, 10:09 AM

Post #14 of 15 (757 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #30 from Kevin A. McGrail <kmcgrail [at] pccc> ---
(In reply to comment #29)
> (In reply to comment #26)
> > (In reply to comment #22)
> >
> > > /^From \S+ ?[[:upper:]][[:lower:]]{2}(?:, \d\d [[:upper:]][[:lower:]]{2}
> > > \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}| [[:upper:]][[:lower:]]{2} [ 1-3]\d [
> > > 0-2]\d:\d\d:\d\d \d{4})/
> >
> > I doin't see the need of the sequence "(?:," in the regex. For me it works
> > with "(,"
>
> There is no reason to capture the results of the match for later use.
> (?:...) is a non-capturing match, which is slightly more efficient.
>
> To allow for variations in whitespace, perhaps:
>
> /^From\s{1,5}\S+\s{1,5}[[:upper:]][[:lower:]]{2}(?:, \d\d
> [[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}|
> [[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/

Well more importantly than specific kmail 2 mbox regexes , does the patch I
wrote that let's someone use /^Mickey Mouse$/ as their mbox separator regular
expression work?

Regards,
KAM

--
You are receiving this mail because:
You are the assignee for the bug.


bugzilla-daemon at bugzilla

May 20, 2012, 12:33 PM

Post #15 of 15 (759 views)
Permalink
[Bug 6703] sa-learn doesn't work with kmail 2 mbox format [In reply to]

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6703

--- Comment #31 from Thomas Arend <thomas [at] arend-rhb> ---
(In reply to comment #29)
> To allow for variations in whitespace, perhaps:
>
> /^From\s{1,5}\S+\s{1,5}[[:upper:]][[:lower:]]{2}(?:, \d\d
> [[:upper:]][[:lower:]]{2} \d{4} [0-2]\d:\d\d:\d\d [+-]\d{4}|
> [[:upper:]][[:lower:]]{2} [ 1-3]\d [ 0-2]\d:\d\d:\d\d \d{4})/

I never saw more than one whitespace after the "From" but one or two white
spaces after the "e-mail" address / before the weekday.

--
You are receiving this mail because:
You are the assignee for the bug.

SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.