[spambayes-dev] Several new tokenizing gimmicks checked in

skip at pobox.com skip at pobox.com
Sun Aug 6 19:25:47 CEST 2006


With the current crop of pump & dump spams I decided to break down and
actually see if ocrad (http://www.gnu.org/software/ocrad/ocrad.html) would
help.  It does a miserable job from a readability standpoint at extracting
text from an image, but SpamBayes seems to love what it does generate.  This
morning I thought, "what the hell", and checked in all the current new
tricks I've been working on/with:

    * IP address lookup and more extensive tokenization.  This is from Matt
      Cowles.  I added persistence beyond the current run.  Unfortunately,
      the dbm persistence is untested (though should probably work okay)
      while the zodb persistence still has problems (writes the file the
      first time, but doesn't update it on successive runs).  Maybe someone
      can look at those issues.  This seems to work very well for those
      spams where the only useful clue is a URL, but with a domain name that
      changes each time.  They seem to pretty much all point to the same IP
      address as far as I can tell.  Enabled using the x-lookup_ip and
      lookup_ip_cache options.  Requires installation of PyDNS.

    * Note image size.  This was my first stab at trying to get some
      information out of an image.  Seems to work pretty well.  Enabled
      using the x-image_size option.

    * Note short runs of too-short words.  Text spammers (as opposed to
      image spammers) seem to like to use this technique:

          X j A m N j A d X h
          M k E z R d I p D u I m A c
          C o I d A t L j I v S j
 
      to hide their tokens from spam filters.  Enabled using the
      x-short_runs option.  Based on my current database I'm skeptical this
      will add much over what else we already have.

    * Try OCR on images.  The latest technique we've all encountered seems
      to be the pump and dump stock scams where the entire come-on is
      embedded in one or more GIF images.  I wrote a small ImageStripper
      module which handles these.  It grabs the image parts, converts them
      to netpbm format, concatenates them left-to-right, then submits the
      result to ocrad.  This is just a proof-of-concept.  It requires ocrad
      and netpbm to be available.  As such I suspect it will only run
      currently on Unix-like systems.  Enabled using the x-crack_images and
      max_image_size options.

I added these extensions using multiple checkins, so if we decide to back
one or more of them out it shouldn't be a major PITA.

Skip


More information about the spambayes-dev mailing list