[spambayes-dev] Several new tokenizing gimmicks checked in
skip at pobox.com
skip at pobox.com
Sun Aug 6 19:25:47 CEST 2006
With the current crop of pump & dump spams I decided to break down and
actually see if ocrad (http://www.gnu.org/software/ocrad/ocrad.html) would
help. It does a miserable job from a readability standpoint at extracting
text from an image, but SpamBayes seems to love what it does generate. This
morning I thought, "what the hell", and checked in all the current new
tricks I've been working on/with:
* IP address lookup and more extensive tokenization. This is from Matt
Cowles. I added persistence beyond the current run. Unfortunately,
the dbm persistence is untested (though should probably work okay)
while the zodb persistence still has problems (writes the file the
first time, but doesn't update it on successive runs). Maybe someone
can look at those issues. This seems to work very well for those
spams where the only useful clue is a URL, but with a domain name that
changes each time. They seem to pretty much all point to the same IP
address as far as I can tell. Enabled using the x-lookup_ip and
lookup_ip_cache options. Requires installation of PyDNS.
* Note image size. This was my first stab at trying to get some
information out of an image. Seems to work pretty well. Enabled
using the x-image_size option.
* Note short runs of too-short words. Text spammers (as opposed to
image spammers) seem to like to use this technique:
X j A m N j A d X h
M k E z R d I p D u I m A c
C o I d A t L j I v S j
to hide their tokens from spam filters. Enabled using the
x-short_runs option. Based on my current database I'm skeptical this
will add much over what else we already have.
* Try OCR on images. The latest technique we've all encountered seems
to be the pump and dump stock scams where the entire come-on is
embedded in one or more GIF images. I wrote a small ImageStripper
module which handles these. It grabs the image parts, converts them
to netpbm format, concatenates them left-to-right, then submits the
result to ocrad. This is just a proof-of-concept. It requires ocrad
and netpbm to be available. As such I suspect it will only run
currently on Unix-like systems. Enabled using the x-crack_images and
max_image_size options.
I added these extensions using multiple checkins, so if we decide to back
one or more of them out it shouldn't be a major PITA.
Skip
More information about the spambayes-dev
mailing list