[Spambayes] Latest image spam/OCR update

skip at pobox.com skip at pobox.com
Thu Aug 10 07:00:04 CEST 2006


I just checked in a couple significant changes to the OCR stuff.  First, I
added support for conversion of input images using PIL.  That means netpbm
is no longer required.  PIL is faster and more robust than netpbm, and is
platform-independent.  Perhaps someone in Windows-land can take the time to
see if it's possible to build ocrad on Windows.  We could then (in theory,
at least) distribute an ocrad installer alongside the SpamBayes Windows
installer and perform crude, but apparently effective, OCR analysis of
image-based spam.  The second change to the OCR code was the addition of a
simple pickled cache file (controlled by the "crack_image_cache" option).
The conversion to netpbm format is still required, however the ocrad step is
skipped if the md5 hexdigest of the generated image is present in the cache.
In thi case any cached text and tokens are returned.

I have no Windows capability, so someone else will have to take the steps
necessary to make this all play on Windows.

There are a few other things that need testing, but I'm out of time.  First,
I arbitrarily set an upper limit of 100kbytes on input images (per image
before converting to netpbm).  I think that allows all images that would
hold spam content, but I'm not sure I have many images in my training
database besides spam.  I don't know if that's a useful cutoff or if there
should even be a cutoff.  Second, I observed that ocrad routinely seemed to
get the letter case wrong (e.g. coming up with "EGLy" instead of "EGLY"), so
I blindly downshift its output.  I have nothing other than that simple
observation to suggest that should be done.  Third, if other people have
traing databases, running N-fold cross validation tests of these new
gimmicks would be beneficial.  It would be nice if others could verify my
results before a new release is made.  Finally, if you're a Python
programmer (or aspire to be one), picking through the new code would be a
good check.

Too bad the summer's nearly over.  We could use a Summer of Code intern...

Skip


More information about the SpamBayes mailing list