[Spambayes] Image spam

skip at pobox.com skip at pobox.com
Mon Jun 12 01:07:58 CEST 2006


    >> I found an interesting program that might be exactly what you are
    >> looking for: ocrad. This is GNU software, can accept pbm files or
    >> standard input, and outputs text to standard output. So this is a
    >> commandline ocr program that can be used in a script. Don't worry
    >> about the pbm files, the ocred manual describes how to convert other
    >> image formats to pbm (jpeg, png, ps, pdf,...)

I gave this a whirl.  Not so good for the samples I pulled out of my current
spam database.  The first image yielded:

    The Cr_i_i_er
    A TOTALLY NEW WAY TO EN_OY _EXl
    ThP C_a_|_|_P_ _a_P_ _hP p_a_|_ o__ o_ _P_ _o_ a_
    p_pP_IP__P _ha_._ ||_P _o_hl_p Pl_P
    ComP a_d dl__o_P_ wha_ ma_y a_P _alll_p
    "The blgge__ new_ ln _oy_ _lnce _he vlbra_orl"
    A _o_ally _Pw way _o p_loy _P_ |_ o__P__ _omP
    o_ _hP mo__ _o_Pl a_d _P___al p_pP_IP__P_
    po__lblP all |_ _ompa__ PIPpa__ dP_lp_ \
    _ha_._ b_ll_ _o _hP hlphP__ __a_da_d_
    o_ ___P_p_h a_d _Pllablll _

    To __dP___a_d wha_ |_._ all abo__
    yo_ ha_Plo _PP ll |_ a_llo_ al o__wPb _llP by _

Can you tell what it's about?  Here's the second:

    clall_ _o_ Tab_ a_ low a_ __ TB
    _I___ llkP _PO_la_ Clall_ b__ _OP_lall_ _o_m_la_Pd _hP_P
    olll_ a_P _o_ and dl__ol_ablP _ndP__hP _onO_P ThP
    p_Prt o_ _hl_ |_ mo_P dl_Prt ab_o_O_lon ln_o _hP
    blood___Pam _a_hP_ _han _h_o_Oh _hP __oma_h RP__|_
    - a OowPrf_| la__lnO p_Prt o_ _O _o _S ho___

    _. ._

The third image caused an error message in giftopnm.  Here's the fourth:

    _n_
    Levl_ra
    Amblen
    _all_ _ 75
    Pr_ac
    Vallum s1 21
    Soma
    Vlagra 83 33

A bit better.  Here are the last four (separated by "***"):

    Learn_y.

    ***
    _o_o tEhTIFIEO
    p__lw_

    ***
    ?___?nA.

    ***
    AUTHORII_D

    ***
    DP_|_|7 IT N7_7' ,\ __ V
    _ -
    \

Doesn't look so useful to me.  According to the ocrad README file:

    Caveats.
    For better results the characters should be at least 20 pixels high.
    Merged characters are always a problem. Try to avoid them.
    Very bold or very light (broken) characters are also a problem.
    Always see with your own eyes the pnm file before blaming ocrad for the
    results. Remember the saying, "garbage in, garbage out".

Maybe a more mature OCR program would help, but ocrad seems to have a ways
to go.

Skip


More information about the SpamBayes mailing list