[spambayes-dev] [Spambayes] date for new release to handle image spam?

Mark Hammond mhammond at skippinet.com.au
Mon Feb 5 04:24:38 CET 2007


In the message below (which I sent to spambayes instead of -dev), I
mentioned I got much better results with gocr than ocrad.  I've uploaded my
patch at
http://sourceforge.net/tracker/index.php?func=detail&aid=1652111&group_id=61
702&atid=498105, and I've assigned it to skip for a quick scan.  There are
some bits of the outlook patch mixed in there too, but that shouldn't
distract from the rest of the patch.  I'd obviously welcome all testing of
this and am happy to check it in.

Cheers,

Mark

> -----Original Message-----
> From: spambayes-bounces at python.org
> [mailto:spambayes-bounces at python.org]On Behalf Of Mark Hammond
> Sent: Monday, 5 February 2007 11:34 AM
> To: skip at pobox.com
> Cc: spambayes at python.org
> Subject: Re: [Spambayes] date for new release to handle image spam?
>
>
> > If you run ocrad over some spam text images you can see what
> > it generates.
> > If it finds nothing, nothing comes out the back end.  If it
> > sees something,
> > it's almost certain to be some garbage text peculiar to it,
> > unlikely to turn
> > up in normal text.  For example, here's a pretty clean image:
> >
> >     http://www.webfast.com/~skip/bogus-5-3.png
> >
> > Here's what ocrad produces by default:
> >
> >     COULD THl_ BE THE NEXT IBM_
> >     ALL _|___ _wow IWAl LllL |_ ABO_| lo EXPLODEl
> >     WAIIW LllL p_ Ll_E A WAW_ _IARll__ WO_DA_ _EPIEWBER lll
> >
> >     IomO_n_ __m_ L |_IL IOWP_IER_ |_I (o_h__ OII LllL p_)
> >     __o__ __mbol LllL
> >     F_ld__ Ilo__ O Tl (_o s_/_ On F_ld__ Alon_|)
> >     _ d__ |__o__ __
> >     I____n_ R__lnO ___onO B__
> >     \
> >     ln _h_ Io____ ot _ W___. LllL W____ ______| ___nnlnO Wo___'
> >
> >     L ln___n__lon_| Anno_n___
> >
> >     On_lo__h(IW) _P_o_P__ TP_hnoloO_ b_
> >     B_llP_ p_oo_ Da_a _P___|__ Ba_k_O_ and _P__o_P_
> >     |__ ____ __n____lon p__Aqco_TM_/P__AID CO_TM_
> >     _|__a Po__ablP wloh _OPPd _olld __a_P D_|_P TP_hnoloO_
> >     _h_ W___oOoll_. _hP Wo_ld _ _|___ _g laO_oO ComOrfP_
> >     _Pa___lnO W_ldla _ Q_a_ll TP_hnoloO_
> >     \
> >     L ln___n__lon_| _IOn_ _4 _W E__oO__n Dl___lb__lon AO___m_n_
> >
> >     Th_ b_Pmo__ __PO b_wa_d _a__|_al _Pn___P |_ amonO o_hP_
> p__|__|_P
> >     dl___lb__lon aO_PPmPn__ ____Pn_|_ _ndP_ nPOo_la_lon ª_
> > _P_P_al addl_lonal
> >     hlOh O_ofi_ _POlon_ and _PO_P_Pn__ a kP_ ___a_POl_
> > Oa__nP__hlO _ha_ _P___P_
> >     l ln_P_na_lonal ComO__P__ wl_h ___|_ Olobal ma_kP_ _Pa_h
> > and O_a_an_PPd
> >     O_P _alP_ and lo_k_ _hP _omOan_ ln hlOhl_ dP_|_ablP
> > p__|__|_P dl___lb__lon
> >     ma_kP__
> >
> >     READ MORE ONLINE NOWl
> >
> >     OPPORl__||_ DOE_ _ol __OI_ o_ IWE DOOR E_ER_ DA_|
> >     _o _A_E A Wl__IE IOODD LllL lo _O_R RADAR _ow A_D
> >     WAIIW II _OARl
>
> FWIW, I am getting *much* better results with gocr than
> ocrad.  gocr running
> over that same image results in:
>
> --- 8< ---
> _        _ _   _
> COULD THIS BE THE NEXT IBM?
> ALL SIGNS SHOW THAT LITL IS ABOUT TO EXPLODE!
>
> Company Name:
> Stock Symbol:
> Friday Close:    O.71 (Up 6O_a On Friday Alone!)
> S-dayTarget:   $3
> Current Rating:  Strong Buy
> \
>
> In the Course of a Week, LITL Makes Several Stunning Moves!
>
> L International Announces:
>
> - OneTouch(TM) Recovery Technology hr
> Bullet-Proof Data Security Backups and Restores          ,
> - Its Next-Generation PuRA_GO(TM)/PuRAID-GO(TM)
> UItra-Portable High-Speed Solid State Drive Technology
> . - the metropolis, the worldt First l9'' Laptop compWer
> Featuring Nvidiat Quad-SLI Technology   _
>
> \
> L International Signs $4SM European Distribution Agreement
>
> - T_s hremost step hrward tactical venture is, among other exclusive
> distribution agreements, currently under negotiation gr
> several additional
> high-pro_t regions and represents a key strategic partnership
> that secures
> L International Computers with truly global market reach and
> guaranteed
> pre-sales, and locks the company in highly desirable
> exclusive distribution
> marke.ts.
>
> --- >8 ----
>
> Indeed, I have never seen an image that ocrad does better on
> than gocr.
> FWIW, I'm currently 1/2 way through modifying spambayes to
> support either
> ocrad or gocr, in the hope that using gocr will actually
> cause a noticible
> reduction in image spam - unfortunately, using gocr I see no
> reduction at
> all (which isn't to say there is not a small reduction - it
> just doesn't
> "seem" to me like it has reduced).
>
> Mark
>
> _______________________________________________
> SpamBayes at python.org
> http://mail.python.org/mailman/listinfo/spambayes
> Check the FAQ before asking: http://spambayes.sf.net/faq.html
>



More information about the spambayes-dev mailing list