[Spambayes] Latest CVS update, Ocrad for Windows
skip at pobox.com
skip at pobox.com
Mon Aug 14 05:37:20 CEST 2006
I updated the OCR capabilities a bit more today. I added more intelligent
assembly of split images into a single image after noticing that the
spammers don't simply chop up multi-part GIF images horizontally. I also
added a couple extra options (ocrad_scale and ocrad_charset) which control
the image scaling factor (default is 2) and character set (default is
"ascii") Ocrad uses. Scaling the image by a factor of 2 was a pretty
obvious win:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 5 times
lost 0 times
total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied
false negative percentages
4.213 4.213 tied
1.404 0.843 won -39.96%
3.371 2.809 won -16.67%
2.528 2.247 won -11.12%
4.213 3.652 won -13.32%
won 4 times
tied 1 times
lost 0 times
total unique fn went from 56 to 49 won -12.50%
mean fn % went from 3.14606741573 to 2.75280898876 won -12.50%
Scaling by a factor of three was even better in the false negative
department but regressed a bit in the false positive category so I checked
Options.py in with a default scaling factor of 2. A couple things could
stand to be further tested:
* I have no idea how good Ocrad's scaling algorithm is. It's possible
that PIL or NetPBM's scaling code is better. If so, it would make
sense to scale the images before feeding to Ocrad.
* The images I've see so far were all plain English, so I blindly made
ascii the default charset. The other choices were iso-8859-9 and
iso-8859-15. I simply assumed ascii would be the most appropriate
default, but didn't test it.
Finally, I put together a really simpleminded Ocrad-for-Windows release
based upon the ocrad.exe binary that Tony built. Check the Files section of
the SpamBayes project site:
http://sourceforge.net/project/showfiles.php?group_id=61702
and grab ocrad-cygwin.
There are a few caveats:
1. I don't do Windows. (No, really, I don't, strange as that may seem.)
This is no fancy-schmancy point-and-shoot Windows installer. It's
just a simple zip file with the Ocrad 0.15 distribution, Tony's .exe
file and the patch he applied to the source.
2. I don't do Windows. The code I've written so far has been done
entirely on my Mac. I've made no obvious concessions to portability.
That said, I hope portability issues won't be daunting for any early
adopters.
3. I don't do Windows. If you have problems it won't do you any good to
mail me directly. Post about problems on the SpamBayes bug tracker:
http://sourceforge.net/tracker/?group_id=61702&atid=498103
4. If you do Windows you will need PIL to take advantage of the recent
changes:
http://www.pythonware.com/products/pil/
(unless you want to put hair on your chest and build NetPBM on
Windows). Fredrik Lundh provides prebuilt Windows versions of PIL.
Grab the one appropriate for the version of Python you have
installed.
5. If you do Windows (or any other platform for that matter), feedback
to the lists about successes and failures would be helpful.
Cheers,
Skip
More information about the SpamBayes
mailing list