[spambayes-dev] Choosing which image to OCR

skip at pobox.com skip at pobox.com
Wed Sep 6 03:34:01 CEST 2006


I took a few minutes to examine a couple (as in exactly two) multi-frame GIF
images from stock spams I received in the past couple days.  I'd like a
cheap test to decide which frame is the best candidate for OCR without
OCRing every frame.  The computational costs are high enough already.

I have two images, bogus-0.gif and bogus-1.gif (both attached to this
message).  For each one I ran the following loop:

    >>> img = Image.open("bogus-0.gif")
    >>> for (i, frame) in enumerate(ImageSequence(img)):
    ...   bg = max(frame.histogram())
    ...   npixels = len([x for x in frame.histogram() if x])
    ...   print bg, npixels

For bogus-0.gif I got:

    220259 33
    217760 52
    213225 96
    182636 256
    222500 1

For bogus-1.gif I got:

    326518 5
    322180 9
    322817 7
    280174 11
    314741 10

It seems that the frame with the fewest white pixels (or the fewest pixels
in the most frequently used palette position) is a decent indicator of the
frame with the most useful pixels.  I also tried this more expensive test at
the shell:

    % for f in bogus-1-?.png ; do
        echo "*** $f ***"
        pngtopnm $f | ocrad | wc -c
      done
    *** bogus-1-0.png ***
           8
    *** bogus-1-1.png ***
          31
    *** bogus-1-2.png ***
          18
    *** bogus-1-3.png ***
        1219
    *** bogus-1-4.png ***
         340

The fourth frame does indeed have the most text.

This didn't work for the bogus-0.png file because this save loop in Python
didn't work properly:

    >>> img = Image.open("bogus-0.gif")
    >>> for (i, frame) in enumerate(ImageSequence(img)):
    ...   frame.save(open("bogus-0-%d.png" % i, "wb"))

The first frame saved has the proper palette.  The other saved frames are
just black-and-white.  (Can someone with more PIL experience explain why
this is so and how to get around it?)

I can imagine a spammer putting together a palette where there are 248
not-quite-white palette entries making up an essentially white background
and a few entries devoted to displaying text.  I'm sure PIL has something we
could use to quantize the palette down to 16 colors or so, then use that
palette to compute histograms, so I'm not all that worried about that
scheme.

The spammers do seem to be adapting very quickly (take a look at the frames
in bogus-1.gif to see what I mean).  I find it hard to believe it's in
response to what we're doing here.  I'm sure some other much bigger groups
must be doing OCR analysis of image-based spam these days.

BTW, don't worry too much if your mail program won't display the two images
properly.  XEmacs didn't like them at all, but Mozilla displayed them just
fine.

Skip


-------------- next part --------------
A non-text attachment was scrubbed...
Name: bogus-0.gif
Type: image/gif
Size: 46862 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20060905/ae5bb8be/attachment-0002.gif 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bogus-1.gif
Type: image/gif
Size: 37727 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20060905/ae5bb8be/attachment-0003.gif 


More information about the spambayes-dev mailing list