Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: users

__DRUG_MUSCLE1 false-positives

 

 

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded


dfs at roaringpenguin

May 17, 2012, 8:18 AM

Post #1 of 7 (462 views)
Permalink
__DRUG_MUSCLE1 false-positives

Hi,

We have a Swedish customer who is seeing lots of DRUG_MUSCLE FP's. It
turns out that __DRUG_MUSCLE1 is triggering on the common Swedish
phrase "som r".

I looked at the regex and it seems that Perl treats r as having a
word boundary in the \b sense between the "" and the "r"

Maybe rewrite as follows (untested):

body __DRUGS_MUSCLE1 /(?:\b|\s)[_\W]{0,3}s[_\W]{0,3}[o0\xF2-\xF6][_\W]{0,3}m[_\W]{0,3}[a4\xE0-\xE6@][_\W]{0,3}(?!\w)/i

Regards,

David.


Jason_Haar at trimble

May 17, 2012, 12:26 PM

Post #2 of 7 (447 views)
Permalink
Re: __DRUG_MUSCLE1 false-positives [In reply to]

On 18/05/12 03:18, David F. Skoll wrote:
>
> I looked at the regex and it seems that Perl treats r as having a
> word boundary in the \b sense between the "" and the "r"
A bit OT, but is it because your perl is running under "C" locale
instead of se? i.e. would the word boundary definition change under
different localization contexts? Doesn't help solve the problem for you,
but it certainly flags a potential issue with a tonne of the rules in SA...


--
Cheers

Jason Haar
Information Security Manager, Trimble Navigation Ltd.
Phone: +1 408 481 8171
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1


darxus at chaosreigns

May 17, 2012, 12:54 PM

Post #3 of 7 (445 views)
Permalink
Re: __DRUG_MUSCLE1 false-positives [In reply to]

On 05/18, Jason Haar wrote:
> A bit OT, but is it because your perl is running under "C" locale
> instead of se? i.e. would the word boundary definition change under
> different localization contexts? Doesn't help solve the problem for you,
> but it certainly flags a potential issue with a tonne of the rules in SA...

Locale handling is a known problem is SA:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062

--
"Life is either a daring adventure or it is nothing at all."
- Helen Keller
http://www.ChaosReigns.com


dfs at roaringpenguin

May 17, 2012, 1:35 PM

Post #4 of 7 (447 views)
Permalink
Re: __DRUG_MUSCLE1 false-positives [In reply to]

On Fri, 18 May 2012 07:26:56 +1200
Jason Haar <Jason_Haar [at] trimble> wrote:

> > I looked at the regex and it seems that Perl treats r as having a
> > word boundary in the \b sense between the "" and the "r"
> A bit OT, but is it because your perl is running under "C" locale
> instead of se?

Ah... could be. Hmm, ok. Maybe I'll suggest to the customer to run
under the "se" locale.

On Thu, 17 May 2012 15:54:43 -0400
darxus [at] chaosreigns wrote:

> Locale handling is a known problem is SA:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062

Ugh. I agree with forcing everything to UTF-8, but that's a lot
of work. Definitely worth doing, though.

Regards,

David.


Jason_Haar at trimble

May 17, 2012, 1:37 PM

Post #5 of 7 (444 views)
Permalink
Re: __DRUG_MUSCLE1 false-positives [In reply to]

On 18/05/12 07:54, darxus [at] chaosreigns wrote:
> Locale handling is a known problem is SA:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062

bug opened in 2004 :-(

I'm no linguist but this is probably an extremely hard problem to solve.
An email can have mixtures of languages, so in a perfect world we should
be able to change locale per word (or per char? - eeek!). This also
bleeds into the issues surrounding how "ok_locales" doesn't work (as
desired) in the modern UTF world too. ie SA would need to "know" what
locales an email contains (which helps ok_locales) so that it can then
dynamic change word boundary definitions/etc for rules. Yuck

Perhaps this should be just classified as a bug in perl and forgotten
about ;-) [does python,etc handle this any better?]

--
Cheers

Jason Haar
Information Security Manager, Trimble Navigation Ltd.
Phone: +1 408 481 8171
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1


dfs at roaringpenguin

May 17, 2012, 1:58 PM

Post #6 of 7 (446 views)
Permalink
Re: __DRUG_MUSCLE1 false-positives [In reply to]

On Fri, 18 May 2012 08:37:07 +1200
Jason Haar <Jason_Haar [at] trimble> wrote:

> I'm no linguist but this is probably an extremely hard problem to
> solve. An email can have mixtures of languages, so in a perfect world
> we should be able to change locale per word (or per char? - eeek!).

The only sane solution is to re-encode everything in UTF-8. (You can
remember the original character set for the purpose of "ok_locales",
but because UTF-8 is becoming more common, ok_locales is becoming less
useful.)

Of course, the re-encoding could lose some valuable information that
might be useful for rules :( so you may want a separate class of rules
that operate on the original pristine message.

> Perhaps this should be just classified as a bug in perl and forgotten
> about ;-)

No, I don't think so. In our commercial software, we actually went to
the trouble of converting everything to UTF-8. It helps a lot,
especially for Bayes.

Regards,

David.


uhlar at fantomas

May 18, 2012, 4:45 AM

Post #7 of 7 (441 views)
Permalink
Re: __DRUG_MUSCLE1 false-positives [In reply to]

>On 18/05/12 03:18, David F. Skoll wrote:
>> I looked at the regex and it seems that Perl treats r as having a
>> word boundary in the \b sense between the "" and the "r"

On 18.05.12 07:26, Jason Haar wrote:
>A bit OT, but is it because your perl is running under "C" locale
>instead of se? i.e. would the word boundary definition change under
>different localization contexts? Doesn't help solve the problem for you,
>but it certainly flags a potential issue with a tonne of the rules in SA...

sa would need to switch to correct locale before processing of the
e-mail to avoid this error. Setting the correct locale could be
different for different users and even for different mails.

I'm not sure if this is a way to go, although there may be single cases
where it helps.

I'm more in favor of advanced processing, watching different languages
and/or comparing matching strings for words in different languages,
e.g. FRT_SOMA misfiring for word "somar" (donkey), FRT_PENIS1 for
"penize" (money), FUZZY_CREDIT for "kredit" (credit) etc.

--
Matus UHLAR - fantomas, uhlar [at] fantomas ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Remember half the people you know are below average.

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.