[Spambayes] Windows compatibility - OCR [was: Unwanted stock solicitations]

Vibe Grevsen grevsen at gmail.com
Sat Nov 4 19:27:40 CET 2006


Hi friends,
   
>> >> 1. Put ocrad 0.16 in the path

> I have no experience with mingw but I compiled ocrad
> using it and I'm using the result (without cygwin dll) with no problem,

Ok, but note that the sources posted in spambayes-something was 0.15!
New version 0.16 can be downloaded here: http://ftp.gnu.org/gnu/ocrad/ocrad-0.16.tar.bz2
According to the changelog the character recognition was improved.

If you built a 0.16 exe without cygwin1.dll I would like to test it.
Can you post it somewhere together with a short desciption of how it was built?
"Pretty please with sugar on top".



>> > Have you tried other ocr programs?
>> 
>> No, not yet.
>> 
>> Tony Meyer suggested Tesseract:

> I built tesseract with no problem.
...
> I tested few spam images and the results were poor.



>> I mailed with NoSpam Today! Support (spamassasin based) before I chose SB.
>> They were doing research on FuzzyOcr and ImageInfo. Maybe we could ask
>> again about their results. I believe FuzzyOcr is gocr-based?
>
> Yes, they are using gocr. But as I said in my previous mail it has its
> own  problems.

Ok then it has at least been tried ...



>> Since the ocr is working with ocrad and - as you see below - I get very
>> good results I will be moving on to the next area now.

> You are lucky. My results are so so. Probably I get a reduction of a
> 60/70% of spam with images (which in itself could be considered not bad)
> but way too much spam is not stopped.

I expect results to vary and it is too early in my testing to tell, but today SB caught
17 of 18 spams. I changed spam cutoff to 0.7 however that didn't even seem nescessary.
My database contains 845 spams and 1411 hams. Zero false positives!



>> I think it is far more beneficial to do more research into the actual processing
>> as you commented elsewhere than to start the whole testing/tweaking all over
>> again with a new ocr engine. Of course that is just my opinion...
>
> Yes and no. We need a decent ocr engine to start with than we may focus
> on better image manipulation.

Yes...

>> At the moment spambayes have trouble with image for the following
>> reason:
>
> - PIL sometimes fail to handle the image. I'm still investigating the
> issue but the images seems reasonably correct (IE, Firefox and many
> viewers, on linux and windows, are able to display them). It's quite
> rare and not a big issue

Not an ocr problem, but a preprocessing problem...
It's great that you are looking into this because I for one don't know python
well enough to debug such issues.

> - ocr results are poor. The worst case are when you get a sequenze of
> chars (char space char space ...) or a long word. both are ignored by
> spambayes

Tokenizer problem, configurable. Not related to the ocr engine.
 
> - There are images which contain more than words and in this case we may
> get no tokens.

I have seen many animations with moving artefacts. Usually not a problem,
but it may be in the future. Again some filtering - which is preprocessing - might
be a brilliant idea.

> In few cases if the colors used inside the image are changed you get a
> different result.

We should work on filtering and histogram analysis to determine the correct
threshold level for the ocr. If we find a better way than what ocrad already
does then we can pass it via the -T parameter.

Advanced filtering can even detect repetitive patterns or noise in the background
and remove that.

Sure a professional ocr engine like OmniPage Pro does huge amounts
of preprocessing like i.e. automatic rotation correction etc. already, but that
does not yet seem nescessary for our purpose... 

> I have no knowledge of image processing but I tried few simple
> operations (like scaling, sharpening, convert to gray, ...) but I got no
> results. They were all quick tests and the result are in no way
> conclusive.

I did a course in image analysis. I don't know python / PIL, but I could
probably do some tests in Matlab when my numbers start to deteriorate.

If you have a way to batch-extract images from emails or from a dbx-file
or if you send me a zip of 100+  problematic spam images then I would
be happy to run some tests i.e. on best scale factor and scaling algorithm.
I can batch-convert them so only worry about extraction.



> from my understanding in Options.py you set the default values,
> bayescustomize.ini contain the values chosen by the user an in
> Imagestripper.py the programmer may embed it's values ignoring the user
> choice (joking)

Something like that I think :) Did you try to change this in ImageStripper.py and
did it make any change to the output?


 
>> With 2 you should get this quality image tokens:
>> 
>> watch
>> out
>> here
>> comes
>> the
>> big
>> one!
...
>> That is about a 90% recognition or so.

> Yes, sometimes the results are good and sometimes are much worst. In few
> cases a scaling factor of 3 it's better. Just now I'm doing a retraining
> with ocrad_scale set to 3. we will see in the next days if the result
> are better or worst

Yes, my initial suggestion was scaling by 4, but Skip argued to use 2. He did tests
with different scales already. Intuitively a larger scale should be better.
I found however that it slowed down the analysis.

Now I don't know what ocrad does, but resampling might be better than resizing.



Happy coding :)

Vibe

PS: How come my posts always show up as new threads?
Using OE. Don't want to subscribe.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes/attachments/20061104/40e30dbd/attachment.htm 


More information about the SpamBayes mailing list