parsing MS word docs -- tutorial request

Terry Reedy tjreedy at udel.edu
Wed Oct 29 15:04:09 EDT 2008


Kay Schluehr wrote:
> On 28 Okt., 15:25, bp.tralfamad... at gmail.com wrote:
>> All,
>>
>> I am trying to write a script that will parse and extract data from a
>> MS Word document.  Can / would anyone refer me to a tutorial on how to
>> do that?  (perhaps from tables).  I am aware of, and have downloaded
>> the pywin32 extensions, but am unsure of how to proceed -- I'm not
>> familiar with the COM API for word, so help for that would also be
>> welcome.
>>
>> Any help would be appreciated.  Thanks for your attention and
>> patience.
>>
>> ::bp::
> 
> One can convert MS-Word documents into some class of XML documents
> called MHTML. If I remember correctly those documents had an .mht
> extension. The result is a huge amount of ( nevertheless structured )
> markup gibberish together with text. If one spends time and attention
> one can find pattern in the markup ( we have XML and it's human
> readable ).

A related solution is to use OpenOffice to convert to 
OpenDocumentFormat, a zipped multiple XML format, and then use ODFPY to 
parse the XML and access the contents as linked objects.
http://opendocumentfellowship.com/development/projects/odfpy




More information about the Python-list mailing list