.doc to html and pdf conversion with python

Eric_Dexter at msn.com Eric_Dexter at msn.com
Sun Oct 15 06:22:29 CEST 2006


google won't do a good job with .doc files but they may do pdf to html
and back..  It's per each I just mentioned it to make fun of them here
is my resume converted from a monster.com .doc file

http://docs.google.com/View?docid=dftrj73t_3cfwjdv


Luap777 at gmail.com wrote:
> Alexander Klingenstein wrote:
> > I need to take a bunch of .doc files (word 2000) which have a little text including some tables/layout and mostly pictures and comvert them to a pdf and extract the text and images > separately too. If I have a pdf, I can do create the html with pdftohtml called from python with > popen. However I need an automated way to converst the .doc to PDF first.
>
> Is there some reason you really want to convert to PDF first? You can
> get much better HTML right from the Word doc. You'll lose a lot of info
> going from PDF to HTML.
>
> Something like this can open doc in Word, save as HTML, then close doc.
>
> import os, win32com.client
>
> wdApp = win32com.client.Dispatch("Word.Application")
> wdApp.Visible = 1
>
> def SaveDocAsHTML(docPath, htmlPath):
>     doc = wdApp.Documents.Open(docPath)
>     # See
> mk:@MSITStore:C:\Program%20Files\Microsoft%20Office\OFFICE11\1033\VBAWD10.CHM::/html/womthSaveAs1.htm
>     # in Word VBA help doc for more info.
>
>     # Saves all text and formatting with HTML tags so that the
> resulting document can be viewed in a Web browser.
>     doc.SaveAs(htmlPath, win32com.client.constants.wdFormatHTML)
>     # Saves text with HTML tags with minimal cascading style sheet
> formatting. The resulting document can be viewed in a Web browser.
>     #doc.SaveAs(htmlPath,
> win32com.client.constants.wdFormatFilteredHTML)
>     doc.Close()
>
> And if you aren't satisfied with the ugly HTML you're likely to get,
> you can try running  µTidylib (http://utidylib.berlios.de/) on the
> output after this step also.
> 
> Thank you,
> Paul




More information about the Python-list mailing list