[Spambayes] Several new tokenizing gimmicks checked in
Eric Johnson
ejohnson at imagewireless.ca
Sun Aug 6 21:15:18 CEST 2006
Please forgive now the obvious dumb question...
How do we get these new things to try them out?
Thanks,
Eric
-----Original Message-----
From: spambayes-bounces+ejohnson=imagewireless.ca at python.org
[mailto:spambayes-bounces+ejohnson=imagewireless.ca at python.org]On Behalf Of
skip at pobox.com
Sent: August 6, 2006 11:26 AM
To: spambayes at python.org; spambayes-dev at python.org
Subject: [Spambayes] Several new tokenizing gimmicks checked in
With the current crop of pump & dump spams I decided to break down and
actually see if ocrad (http://www.gnu.org/software/ocrad/ocrad.html) would
help. It does a miserable job from a readability standpoint at extracting
text from an image, but SpamBayes seems to love what it does generate. This
morning I thought, "what the hell", and checked in all the current new
tricks I've been working on/with:
* IP address lookup and more extensive tokenization. This is from Matt
Cowles. I added persistence beyond the current run. Unfortunately,
the dbm persistence is untested (though should probably work okay)
while the zodb persistence still has problems (writes the file the
first time, but doesn't update it on successive runs). Maybe someone
can look at those issues. This seems to work very well for those
spams where the only useful clue is a URL, but with a domain name that
changes each time. They seem to pretty much all point to the same IP
address as far as I can tell. Enabled using the x-lookup_ip and
lookup_ip_cache options. Requires installation of PyDNS.
* Note image size. This was my first stab at trying to get some
information out of an image. Seems to work pretty well. Enabled
using the x-image_size option.
* Note short runs of too-short words. Text spammers (as opposed to
image spammers) seem to like to use this technique:
X j A m N j A d X h
M k E z R d I p D u I m A c
C o I d A t L j I v S j
to hide their tokens from spam filters. Enabled using the
x-short_runs option. Based on my current database I'm skeptical this
will add much over what else we already have.
* Try OCR on images. The latest technique we've all encountered seems
to be the pump and dump stock scams where the entire come-on is
embedded in one or more GIF images. I wrote a small ImageStripper
module which handles these. It grabs the image parts, converts them
to netpbm format, concatenates them left-to-right, then submits the
result to ocrad. This is just a proof-of-concept. It requires ocrad
and netpbm to be available. As such I suspect it will only run
currently on Unix-like systems. Enabled using the x-crack_images and
max_image_size options.
I added these extensions using multiple checkins, so if we decide to back
one or more of them out it shouldn't be a major PITA.
Skip
_______________________________________________
SpamBayes at python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html
--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.7/410 - Release Date: 05/08/2006
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.7/410 - Release Date: 05/08/2006
More information about the SpamBayes
mailing list