[spambayes-dev] NEWTRICKS

Skip Montanaro skip at pobox.com
Fri Dec 26 13:36:09 EST 2003


    Dave> I keep getting quite a few spams which fit the descriptions below
    Dave> (from NEWTRICKS.txt):

    Dave>   - Punctuation sometimes gets inserted in otherwise spammy words
    Dave>     or phrases, e.g.: "Ch-eck ou=t ou-r sel)ection _of grea)t R_X
    Dave>     -emgffj".  It might be helpful to try stripping punctuation.
    Dave>     (Idea from Paul Sorenson)

    Dave>   - Similarly, some letters get replaced by numbers, e.g.:
    Dave>     "V1agra" instead of "Viagra".  Mapping numbers to suitable
    Dave>     letters might help in some situations.

    Dave> Since "this file is for ideas that have or have not yet been
    Dave> tried", I'd love to know what constitutes "trying".  Is there some
    Dave> official testing procedure or corpus we can test against?  I'd
    Dave> like to know whether any change I make is worth proposing.  Of
    Dave> course I can try it on my own databases of Ham and Spam first...

I tried the first (eliding punctuation from words).  From a testing
standpoint it turns out to not be all that useful, I think for a couple
reasons:

    * There are plenty of other spammy clues in such messages which are
      sufficient to kick these messages into spam range.  Most of this stuff
      winds up scoring at 0.95 or above for me.  If they don't score as spam
      for you, train on a few and see how it does then.

    * Training databases full of old-ish mail won't contain many of these
      sorts of messages, so enabling punctuation removal won't change things
      very much.

Skip



More information about the spambayes-dev mailing list