[spambayes-dev] Maybe a little OCR would help...

skip at pobox.com skip at pobox.com
Fri Aug 4 17:20:36 CEST 2006


This is just one simple little test...

I took two pump & dump messages for HLVK I received overnight.  The GIF
image is actually sliced into pieces horizontally, so I wrote a little shell
script to convert the images to netpbm and concatenate them, then sent the
result through ocrad, sorted, uniq'd and downshited the whole mess, then
checked for words the two had in common.  I came up with:
    _
    __
    and
    co
    company
    hlv
    hlvc
    lnc.
    low
    new
    news
    nlv
    now!
    now!!!
    on
    the
    tnis
    wl_
    |_

While that is not a huge increase in the number of tokens and some aren't
going to help, it's still better than what we have today.  Time will tell if
the cost is worth it.  Perhaps if we generate some further interest in ocrad
it will improve as well.

Skip


More information about the spambayes-dev mailing list