Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: users

dealing with subjects forged with accented letters

 

 

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded


msz at astrouw

Feb 5, 2004, 1:37 AM

Post #1 of 6 (266 views)
Permalink
dealing with subjects forged with accented letters

Hi,

Is there a simple way to deal with SPAM subjects written using accented letters?
The simplest would be piping the Subject line through a simple "tr"-like filter
before applying SA checks, but can it be done?

regards, Michal.


jens-sender-8130a1 at spamfreemail

Feb 5, 2004, 7:43 AM

Post #2 of 6 (268 views)
Permalink
Re: dealing with subjects forged with accented letters [In reply to]

Michal Szymanski wrote:

> Hi,
>
> Is there a simple way to deal with SPAM subjects written using accented
> letters? The simplest would be piping the Subject line through a simple
> "tr"-like filter before applying SA checks, but can it be done?

http://sandgnat.com/cmos/cmos.jsp


--
Jens Benecke (jens at spamfreemail.de)
http://www.hitchhikers.de - Europaweite kostenlose Mitfahrzentrale
http://www.spamfreemail.de - 100% saubere Postfächer - garantiert!
http://www.rb-hosting.de - PHP ab 9? - SSH ab 19? - günstiger Traffic


Robert at Menschel

Feb 6, 2004, 12:23 AM

Post #3 of 6 (265 views)
Permalink
Re: dealing with subjects forged with accented letters [In reply to]

Hello Michal,

Thursday, February 5, 2004, 12:37:08 AM, you wrote:

MS> Is there a simple way to deal with SPAM subjects written using accented letters?
MS> The simplest would be piping the Subject line through a simple "tr"-like filter
MS> before applying SA checks, but can it be done?

I use the following (we get foreign email, but since we only understand
English, we expect all subject headings to be in English):

header RM_sl_ForeignChar Subject =~ /\w[äëöü]\w/
describe RM_sl_ForeignChar Subject contains foreign character apparently embedded within a word
score RM_sl_ForeignChar 3.000 # 413s/0h of 97268 corpus (79437s/17831h) 01/24/04

Bob Menschel


msz at astrouw

Feb 6, 2004, 7:44 AM

Post #4 of 6 (264 views)
Permalink
Re: dealing with subjects forged with accented letters [In reply to]

On Thu, Feb 05, 2004 at 11:23:02PM -0800, Robert Menschel wrote:
>
> I use the following (we get foreign email, but since we only understand
> English, we expect all subject headings to be in English):
>
> header RM_sl_ForeignChar Subject =~ /\w[äëöü]\w/
> ...

Hi Robert,

unfortunately, a solution that simple is not for me. We get emails in
Polish and occasionally also in Spanish or German (not to mention
English, of course, but these are no problem :) so we cannot just
spam-them-all. what we need is to filter Subject lines (changing
all "äëöü" to "aeou" and *then* apply SA rules to them.

Michal.

--
Michal Szymanski (msz [at] astrouw)
Warsaw University Observatory, Warszawa, POLAND


lwilton at earthlink

Feb 6, 2004, 7:58 AM

Post #5 of 6 (265 views)
Permalink
Re: dealing with subjects forged with accented letters [In reply to]

Probably should also replace the obvious numeric and special characrters like zer0, thr33, f|ve, $even, etc. while you are at it.

Loren

I have to wonder if it is worth the processor time though. Might be faster to simply build a thesarus of creative misspellings and analyze the sentence that results from the subsitiutions. I expect that is probably essentially what the Bayes stuff does.


-----Original Message-----
From: Michal Szymanski <msz [at] astrouw>
Sent: Feb 6, 2004 6:44 AM
To: Robert Menschel <Robert [at] Menschel>
Cc: spamassassin-users [at] incubator
Subject: Re: dealing with subjects forged with accented letters

On Thu, Feb 05, 2004 at 11:23:02PM -0800, Robert Menschel wrote:
>
> I use the following (we get foreign email, but since we only understand
> English, we expect all subject headings to be in English):
>
> header RM_sl_ForeignChar Subject =~ /\w[äëöü]\w/
> ...

Hi Robert,

unfortunately, a solution that simple is not for me. We get emails in
Polish and occasionally also in Spanish or German (not to mention
English, of course, but these are no problem :) so we cannot just
spam-them-all. what we need is to filter Subject lines (changing
all "äëöü" to "aeou" and *then* apply SA rules to them.

Michal.

--
Michal Szymanski (msz [at] astrouw)
Warsaw University Observatory, Warszawa, POLAND


cmt-spamassassin at someone

Feb 6, 2004, 6:50 PM

Post #6 of 6 (267 views)
Permalink
Re: dealing with subjects forged with accented letters [In reply to]

On Fri, 2004-02-06 at 08:58, Loren Wilton wrote:
> Probably should also replace the obvious numeric and special characrters like zer0, thr33, f|ve, $even, etc. while you are at it.
<snip>
> On Thu, Feb 05, 2004 at 11:23:02PM -0800, Robert Menschel wrote:
> >
> > I use the following (we get foreign email, but since we only understand
> > English, we expect all subject headings to be in English):
> >
> > header RM_sl_ForeignChar Subject =~ /\w[äëöü]\w/
> > ...
<snip>
> all "äëöü" to "aeou" and *then* apply SA rules to them.

If you're interested in doing these transformations, you might want to
have a look-see at CMOScript. I've been attacking this sort of problem
from the other side; not "translating" characters in advance, but
matching the untranslated word against a Regexp with the translations
inside it. The idea is similar though, and I have a list of
translations you might want to use as a starting point (eg: 'b' => ['b',
'8', '\\xDF']). The list is by no means authoritative, or complete, but
it should be a good place to start. Grab the obfu.pl from
http://sandgnat.com/cmos/.

Also, I haven't done much to update CMOScript lately, but my plan has
been to move towards a pre-translator methodology once the SA 2.70/3.0
plugins interface is released. Pre-transforming should help reduce
processing time (CMOScript regexps are HUGE) and should allow for more
re-use.

There are disadvantages to the pre-translate method, however. One such
example is the character "|" which could be either an obfu "I" or an
obfu "L". How would you choose to translate that character? The same
goes for "*", "I", "l". Another possible disadvantage is that it's not
as easy to translate obfu character sequences such as: "m" => "rn" or
"N" => "|\|". I haven't yet come up with a good way to do
pre-transformation and still match these obfu types in a clean manner.

OK this was probably way off-topic and more discussion than you were
looking for. Oh well.

--
Chris Thielen

Easily generate SpamAssassin rules to catch obfuscated spam phrases:
http://www.sandgnat.com/cmos/

SpamAssassin users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.