[Spambayes] Image spam
skip at pobox.com
skip at pobox.com
Mon Jun 12 01:07:58 CEST 2006
>> I found an interesting program that might be exactly what you are
>> looking for: ocrad. This is GNU software, can accept pbm files or
>> standard input, and outputs text to standard output. So this is a
>> commandline ocr program that can be used in a script. Don't worry
>> about the pbm files, the ocred manual describes how to convert other
>> image formats to pbm (jpeg, png, ps, pdf,...)
I gave this a whirl. Not so good for the samples I pulled out of my current
spam database. The first image yielded:
The Cr_i_i_er
A TOTALLY NEW WAY TO EN_OY _EXl
ThP C_a_|_|_P_ _a_P_ _hP p_a_|_ o__ o_ _P_ _o_ a_
p_pP_IP__P _ha_._ ||_P _o_hl_p Pl_P
ComP a_d dl__o_P_ wha_ ma_y a_P _alll_p
"The blgge__ new_ ln _oy_ _lnce _he vlbra_orl"
A _o_ally _Pw way _o p_loy _P_ |_ o__P__ _omP
o_ _hP mo__ _o_Pl a_d _P___al p_pP_IP__P_
po__lblP all |_ _ompa__ PIPpa__ dP_lp_ \
_ha_._ b_ll_ _o _hP hlphP__ __a_da_d_
o_ ___P_p_h a_d _Pllablll _
To __dP___a_d wha_ |_._ all abo__
yo_ ha_Plo _PP ll |_ a_llo_ al o__wPb _llP by _
Can you tell what it's about? Here's the second:
clall_ _o_ Tab_ a_ low a_ __ TB
_I___ llkP _PO_la_ Clall_ b__ _OP_lall_ _o_m_la_Pd _hP_P
olll_ a_P _o_ and dl__ol_ablP _ndP__hP _onO_P ThP
p_Prt o_ _hl_ |_ mo_P dl_Prt ab_o_O_lon ln_o _hP
blood___Pam _a_hP_ _han _h_o_Oh _hP __oma_h RP__|_
- a OowPrf_| la__lnO p_Prt o_ _O _o _S ho___
_. ._
The third image caused an error message in giftopnm. Here's the fourth:
_n_
Levl_ra
Amblen
_all_ _ 75
Pr_ac
Vallum s1 21
Soma
Vlagra 83 33
A bit better. Here are the last four (separated by "***"):
Learn_y.
***
_o_o tEhTIFIEO
p__lw_
***
?___?nA.
***
AUTHORII_D
***
DP_|_|7 IT N7_7' ,\ __ V
_ -
\
Doesn't look so useful to me. According to the ocrad README file:
Caveats.
For better results the characters should be at least 20 pixels high.
Merged characters are always a problem. Try to avoid them.
Very bold or very light (broken) characters are also a problem.
Always see with your own eyes the pnm file before blaming ocrad for the
results. Remember the saying, "garbage in, garbage out".
Maybe a more mature OCR program would help, but ocrad seems to have a ways
to go.
Skip
More information about the SpamBayes
mailing list