.doc to html and pdf conversion with python

Paul McNett p at ulmcnett.com
Sat Oct 14 21:42:18 CEST 2006

Alexander Klingenstein wrote:
> I need to take a bunch of .doc files (word 2000) which have a little text including some tables/layout and mostly pictures and comvert them to a pdf and extract the text and images separately too. If I have a pdf, I can do create the html with pdftohtml called from python with popen. However I need an automated way to converst the .doc to PDF first.
> Is there a way to do what I want either with a python lib, 3rd party app, or maybe remote controlling Word (a la VBA) by "printing" to PDF with a distiller?
> I already tried wvware from gwnuwin32, however it has problems with big image files embedded in .doc file(looks like a mmap error).

I would try scripting OpenOffice from Python, using the Python-UNO bridge.


Once you have the pdf, use the pdftohtml to get access to the image 
elements you need.

pkm ~ http://paulmcnett.com

More information about the Python-list mailing list