Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: SpamAssassin: devel

[Bug 2908] New: Use bayes translation to decrease effectiveness of intentional misspellings

 

 

SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded


bugzilla-daemon at bugzilla

Jan 8, 2004, 4:35 AM

Post #1 of 2 (98 views)
Permalink
[Bug 2908] New: Use bayes translation to decrease effectiveness of intentional misspellings

http://bugzilla.spamassassin.org/show_bug.cgi?id=2908

Summary: Use bayes translation to decrease effectiveness of
intentional misspellings
Product: Spamassassin
Version: 2.61
Platform: PC
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P5
Component: spamassassin
AssignedTo: spamassassin-dev [at] incubator
ReportedBy: cmt-spamassassin [at] someone


The latest crop of spam I receive contains misspellings of spam-sign words, such
as generic, viagra, paris, hilton. Some simple examples of permutations I
receive are geenric vvvaigraa ppariis hilllton. To counteract this, I have
written a simple modification to sub tokenize_line in Bayes.pm.

pseudocode:

(For each non-header token)
Strip sk: prefix from token if it was added previously
Remove all non-alpha characters
Force token to lowercase (I have no idea if this is a good idea)
Sort the characters in the string (bananas => aaabnns)
Prepend sk: to string if we stripped it
Add new token to bayes token list
Strip any repeated characters (aaabnns => abns)
Add new token to bayes token list

This has the effect that the words translate as such:

generic, viagra, paris, hilton
debug: BAYES TRANSLATE: generic: ceeginr, ceginr
debug: BAYES TRANSLATE: viagra: aagirv, agirv
debug: BAYES TRANSLATE: paris: aiprs, aiprs
debug: BAYES TRANSLATE: hilton: hilnot, hilnot

geenric vvvaigraa ppariis hilllton
debug: BAYES TRANSLATE: geenric: ceeginr, ceginr
debug: BAYES TRANSLATE: vvvaigraa: aaagirvvv, agirv
debug: BAYES TRANSLATE: ppariis: aiipprs, aiprs
debug: BAYES TRANSLATE: hilllton: hilllnot, hilnot

in my bayes database, agirv, aiprs, hilnot all score very high. ceginr scores
neutrally.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


marc at perkel

Jan 8, 2004, 7:06 AM

Post #2 of 2 (96 views)
Permalink
Re: [Bug 2908] New: Use bayes translation to decrease effectiveness of intentional misspellings [In reply to]

Here's something I'm doing to catch misspellings.

I have a list of about 100 words commonly deliberately misspelled. I
first remove all the words that are correctly spelled based in this
list. Then I translate characters - @-a 0-o 1-i etc. I then remove all
punctuion and space characters. Then - I check for the misspelled words
again after spell correcting them, and if there's a match - it's spam.

SpamAssassin devel RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.