[spambayes-dev] Maybe a little OCR would help...
skip at pobox.com
skip at pobox.com
Fri Aug 4 17:20:36 CEST 2006
This is just one simple little test...
I took two pump & dump messages for HLVK I received overnight. The GIF
image is actually sliced into pieces horizontally, so I wrote a little shell
script to convert the images to netpbm and concatenate them, then sent the
result through ocrad, sorted, uniq'd and downshited the whole mess, then
checked for words the two had in common. I came up with:
_
__
and
co
company
hlv
hlvc
lnc.
low
new
news
nlv
now!
now!!!
on
the
tnis
wl_
|_
While that is not a huge increase in the number of tokens and some aren't
going to help, it's still better than what we have today. Time will tell if
the cost is worth it. Perhaps if we generate some further interest in ocrad
it will improve as well.
Skip
More information about the spambayes-dev
mailing list