[Spambayes-checkins] spambayes/spambayes ImageStripper.py,1.4,1.5

Sun Sep 10 00:18:31 CEST 2006

Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv30280

Modified Files:
	ImageStripper.py 
Log Message:
Add crude support for multi-frame GIFs to PIL_decode_parts().  I made a few
assumptions:

    1. NetPBM support will eventually be ripped out.  Everyone should be
       able to install PIL.  Consequently, no attempt to update the NetPBM
       code was made.

    2. The image with the fewest background pixels is probably the one
       containing the text.  GIF image frames can be just part of the
       overall image, so this assumption will be violated in the future.
       For the time being it appears most spammers have a hard time setting
       frame duration properly (are they trying to induce epileptic seizures
       or sell stocks?), let alone carving up frames into pieces.  We'll
       cross that bridge when we come to it.

    3. If an image's info dict doesn't have a "duration" key it's assumed to
       be a single-frame image.


Index: ImageStripper.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** ImageStripper.py	14 Aug 2006 02:58:11 -0000	1.4
--- ImageStripper.py	9 Sep 2006 22:18:28 -0000	1.5
***************
*** 22,26 ****
  
  try:
!     from PIL import Image
  except ImportError:
      Image = None
--- 22,26 ----
  
  try:
!     from PIL import Image, ImageSequence
  except ImportError:
      Image = None
***************
*** 189,192 ****
--- 189,219 ----
              continue
          else:
+             # Spammers are now using GIF image sequences.  From examining a
+             # miniscule set of multi-frame GIFs it appears the frame with
+             # the fewest number of background pixels is the one with the
+             # text content.
+ 
+             if "duration" in image.info:
+                 # Big assumption?  I don't know.  If the image's info dict
+                 # has a duration key assume it's a multi-frame image.  This
+                 # should save some needless construction of pixel
+                 # histograms for single-frame images.
+                 bgpix = 1e17           # ridiculously large number of pixels
+                 try:
+                     for frame in ImageSequence.Iterator(image):
+                         # Assume the pixel with the largest value is the
+                         # background.
+                         bg = max(frame.histogram())
+                         if bg < bgpix:
+                             image = frame
+                             bgpix = bg
+                 # I've empirically determined:
+                 #   * ValueError => GIF image isn't multi-frame.
+                 #   * IOError => Decoding error
+                 except IOError:
+                     tokens.add("invalid-image:%s" % part.get_content_type())
+                     continue
+                 except ValueError:
+                     pass
              image = image.convert("RGB")