[spambayes-dev] gocr is definitely improving...
skip at pobox.com
skip at pobox.com
Tue Feb 6 03:07:01 CET 2007
I got a mail with image spam today (I probably got quite a few but gmail
blocks most of them nowadays):
http://www.webfast.com/~skip/thermometer.gif
I ran gocr 0.41 over it and got this output:
> _'__o______ __ ____o______ ___
i__8
_____ 00,__ 0 0_,_>
0 __8 ___E3 __>_E3 __ E3_,__ _____
0,__,_ _ _0______ _ 0 __0
_, ___ _E3____ E3 _ _ _ ____ ____ 'o__0____ ____ 0>,E3
_______ __ _________, _,______ _ 0 __________ ___,,_____,
____ ____',____ ____ ___ ___ _ 0 ___ >__ ____ ___
____ _ ___E3_ ___e__ ___E3___ 0 ______
The latest version is 0.43, so I downloaded and built it (with a couple
slight tweaks needed). When fed the same image it spit out:
_ _ _ _ _ _ _
X;niy_nha_ Technology Ltd
qnb oI_ _
p_rce I1.SB lP 1_.6_
hb te: H_ts Il_ghs of I1._B TodJy
.M_ rc _ Fxpected T _ rr _
Ini thc Izst 3 _ eks they ha_e ianded o_er I1.Z
M II_on _n contracts. TJdays n _ Jnnounced anothe?
huge cont_iact. Read all the n _ and set ycur buy
fur_ mm f_rst cn_ng Tuesday nD rn_ng!
Pretty huge improvement. (I think you can see why I gave up on gocr
before.) By comparison, with my latest massaging of the input fed to ocrad
I get:
X?nU?nha? TechnologU L_d
glbol! _
p_rce __.58 LP _3.6_
__e: H__s H_ghs or __.78 Tod_V
_re _ Expec_ed T_rr_
In _he las_ 3 _ehs _heV ha_e landed o_er t_.2
n?ll?on ?n con_roc_s, TodoVs n_ onnounced ono_her
huge con_rac_, RPad all _he n_ and se_ Uour buU
ror mM r?rs_ _h?ng TuesdaU nDrn?ng!
Without any massaging ocrad doesn't find any text. You have to give the
--invert flag. Seems like it should automatically try to invert the image
if its first attempt to extract text completely fails.
At any rate, gocr looks much better than it did. I'm going to install it
and give your patch a try for a couple days. It looks fine based on a
simple skim of the changes. Go ahead and check it in so more people can
play with it.
Skip
More information about the spambayes-dev
mailing list