Windows compatibility - OCR [was: Unwanted stock solicitations]
Hi friends,
1. Put ocrad 0.16 in the path
I have no experience with mingw but I compiled ocrad using it and I'm using the result (without cygwin dll) with no problem,
Ok, but note that the sources posted in spambayes-something was 0.15! New version 0.16 can be downloaded here: http://ftp.gnu.org/gnu/ocrad/ocrad-0.16.tar.bz2 According to the changelog the character recognition was improved. If you built a 0.16 exe without cygwin1.dll I would like to test it. Can you post it somewhere together with a short desciption of how it was built? "Pretty please with sugar on top".
Have you tried other ocr programs?
No, not yet.
Tony Meyer suggested Tesseract:
I built tesseract with no problem. ... I tested few spam images and the results were poor.
I mailed with NoSpam Today! Support (spamassasin based) before I chose SB. They were doing research on FuzzyOcr and ImageInfo. Maybe we could ask again about their results. I believe FuzzyOcr is gocr-based?
Yes, they are using gocr. But as I said in my previous mail it has its own problems.
Ok then it has at least been tried ...
Since the ocr is working with ocrad and - as you see below - I get very good results I will be moving on to the next area now.
You are lucky. My results are so so. Probably I get a reduction of a 60/70% of spam with images (which in itself could be considered not bad) but way too much spam is not stopped.
I expect results to vary and it is too early in my testing to tell, but today SB caught 17 of 18 spams. I changed spam cutoff to 0.7 however that didn't even seem nescessary. My database contains 845 spams and 1411 hams. Zero false positives!
I think it is far more beneficial to do more research into the actual processing as you commented elsewhere than to start the whole testing/tweaking all over again with a new ocr engine. Of course that is just my opinion...
Yes and no. We need a decent ocr engine to start with than we may focus on better image manipulation.
Yes...
At the moment spambayes have trouble with image for the following reason:
- PIL sometimes fail to handle the image. I'm still investigating the issue but the images seems reasonably correct (IE, Firefox and many viewers, on linux and windows, are able to display them). It's quite rare and not a big issue
Not an ocr problem, but a preprocessing problem... It's great that you are looking into this because I for one don't know python well enough to debug such issues.
- ocr results are poor. The worst case are when you get a sequenze of chars (char space char space ...) or a long word. both are ignored by spambayes
Tokenizer problem, configurable. Not related to the ocr engine.
- There are images which contain more than words and in this case we may get no tokens.
I have seen many animations with moving artefacts. Usually not a problem, but it may be in the future. Again some filtering - which is preprocessing - might be a brilliant idea.
In few cases if the colors used inside the image are changed you get a different result.
We should work on filtering and histogram analysis to determine the correct threshold level for the ocr. If we find a better way than what ocrad already does then we can pass it via the -T parameter. Advanced filtering can even detect repetitive patterns or noise in the background and remove that. Sure a professional ocr engine like OmniPage Pro does huge amounts of preprocessing like i.e. automatic rotation correction etc. already, but that does not yet seem nescessary for our purpose...
I have no knowledge of image processing but I tried few simple operations (like scaling, sharpening, convert to gray, ...) but I got no results. They were all quick tests and the result are in no way conclusive.
I did a course in image analysis. I don't know python / PIL, but I could probably do some tests in Matlab when my numbers start to deteriorate. If you have a way to batch-extract images from emails or from a dbx-file or if you send me a zip of 100+ problematic spam images then I would be happy to run some tests i.e. on best scale factor and scaling algorithm. I can batch-convert them so only worry about extraction.
from my understanding in Options.py you set the default values, bayescustomize.ini contain the values chosen by the user an in Imagestripper.py the programmer may embed it's values ignoring the user choice (joking)
Something like that I think :) Did you try to change this in ImageStripper.py and did it make any change to the output?
With 2 you should get this quality image tokens:
watch out here comes the big one! ... That is about a 90% recognition or so.
Yes, sometimes the results are good and sometimes are much worst. In few cases a scaling factor of 3 it's better. Just now I'm doing a retraining with ocrad_scale set to 3. we will see in the next days if the result are better or worst
Yes, my initial suggestion was scaling by 4, but Skip argued to use 2. He did tests with different scales already. Intuitively a larger scale should be better. I found however that it slowed down the analysis. Now I don't know what ocrad does, but resampling might be better than resizing. Happy coding :) Vibe PS: How come my posts always show up as new threads? Using OE. Don't want to subscribe.
I invite those of you working on the OCR stuff on Windows to subscribe to the spambayes-dev mailing list if you are not already subscribed: http://mail.python.org/mailman/listinfo/spambayes-dev Also, you should read the README-DEVEL.txt file in the top level directory of the CVS repository, especially if you want to test the various settings and have some hope of making apples-to-apples comparisons. Skip
participants (2)
-
skip@pobox.com -
Vibe Grevsen