[Tutor] Fwd: Extract image from RTF file

Marc Tompkins marc.tompkins at gmail.com
Sat Feb 14 20:36:40 CET 2009


Forgot to Reply All.

---------- Forwarded message ----------
From: Marc Tompkins <marc.tompkins at gmail.com>
Date: Sat, Feb 14, 2009 at 11:35 AM
Subject: Re: [Tutor] Extract image from RTF file
To: Bryan Fodness <bryan.fodness at gmail.com>


On Sat, Feb 14, 2009 at 8:40 AM, Bryan Fodness <bryan.fodness at gmail.com>wrote:

> I have a large amount of RTF files where the only thing in them is an
> image.  I would like to extract them an save them as a png.
> Eventually, I would like to also grab some text that is on the image.
> I think PIL has something for this.
>
> Does anyone have any suggestion on how to start this?
>

I'm no kind of expert, but I do have a pointer or two...  RTF files are text
with lots and lots of funky-looking formatting, but generally not "binary"
in the sense of requiring special handling (although, now that I just read
about how pictures are stored in them, it seems there might be some
exceptions...)  There's a Python library for dealing with RTF files (
http://www.nava.de/2005/04/06/pyrtf/) but I haven't tried it; if you're
comfortable opening text files and handling their contents, it might be
simpler to roll your own for this task.

You'll want to look at the Microsoft RTF specification, the latest version
of which (1.6) is available here:
    http://msdn.microsoft.com/en-us/library/aa140277(office.10).aspx<http://msdn.microsoft.com/en-us/library/aa140277%28office.10%29.aspx>

In particular, you'll be interested in the section on Pictures, which I'll
excerpt here:
Pictures

An RTF file can include pictures created with other applications. These
pictures can be in hexadecimal (the default) or binary format. Pictures are
destinations, and begin with the \*pict* control word. The *\pict* keyword
is preceded by* \*\shppict* destination control keyword as described in the
following example. A picture destination has the following syntax:
 <pict> '{' *\pict* (<brdr>? & <shading>? & <picttype> & <pictsize> &
<metafileinfo>?) <data> '}'  <picttype> |* \emfblip* |* \pngblip*
|*\jpegblip | \macpict
* | *\pmmetafile* | *\wmetafile* | *\dibitmap* <bitmapinfo> | *\wbitmap *
<bitmapinfo>  <bitmapinfo> *\wbmbitspixel *& *\wbmplanes* & *\wbmwidthbytes*
<pictsize> (\*picw* & *\pich*) \*picwgoal*? & \*pichgoal*? *\picscalex*? & *
\picscaley*? & *\picscaled*? & *\piccropt*? & *\piccropb*? & *\piccropr*? &
*\piccropl*?  <metafileinfo> *\picbmp *& *\picbpp*  <data> (\*bin* #BDATA) |
#SDATA


Basically, it looks like you can search for "{\pict", then search for the
closing "}".  Everything in between will be your picture, plus metadata that
tells you how to decode it.

Now that you've caught your rabbit... I'm out of advice; I've never used PIL
(though I used to listen to them all the time.)

-- 
www.fsrtechnologies.com



-- 
www.fsrtechnologies.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090214/eb34fe72/attachment.htm>


More information about the Tutor mailing list