Extract an image from a RTF file

Curt Hash curt.hash at gmail.com
Sat Feb 14 14:15:49 EST 2009


On Sat, Feb 14, 2009 at 11:01 AM, Terry Reedy <tjreedy at udel.edu> wrote:
>
> Bryan.Fodness at gmail.com wrote:
>>
>> I have a large amount of RTF files where the only thing in them is an
>> image.  I would like to extract them an save them as a png.
>> Eventually, I would like to also grab some text that is on the image.
>> I think PIL has something for this.
>>
>> Does anyone have any suggestion on how to start this?
>
> Wikepedia Rich Text Format has several links, which lead to
> http://pyrtf.sourceforge.net/
> http://code.google.com/p/pyrtf-ng/
> The former says rtf generation, including images.
> The latter says rtf generation and parsing, but only claims to be a rewrite of the former.
>
> --
> http://mail.python.org/mailman/listinfo/python-list

I've written an RTF parser in Python before, but for the purpose of
filtering and discarding content rather than extracting it.

Take a look at the specification here:
http://www.microsoft.com/downloads/details.aspx?familyid=dd422b8d-ff06-4207-b476-6b5396a18a2b&displaylang=en

You will find that images are specified by one or more RTF control
words followed by a long string of hex data. For this special purpose,
you will not need to write a parser for the entire specification. Just
search the file for the correct sequence of control words, extract the
hex data that follows, and save it to a file.

It helps if you open the RTF document in a text editor and locate the
specific control group that contains the image, as the format and
order of control words varies depending on the application that
created it. If all of your documents are created with the same
application, it will be much easier.



More information about the Python-list mailing list