parsing MS word docs -- tutorial request
Terry Reedy
tjreedy at udel.edu
Wed Oct 29 15:04:09 EDT 2008
Kay Schluehr wrote:
> On 28 Okt., 15:25, bp.tralfamad... at gmail.com wrote:
>> All,
>>
>> I am trying to write a script that will parse and extract data from a
>> MS Word document. Can / would anyone refer me to a tutorial on how to
>> do that? (perhaps from tables). I am aware of, and have downloaded
>> the pywin32 extensions, but am unsure of how to proceed -- I'm not
>> familiar with the COM API for word, so help for that would also be
>> welcome.
>>
>> Any help would be appreciated. Thanks for your attention and
>> patience.
>>
>> ::bp::
>
> One can convert MS-Word documents into some class of XML documents
> called MHTML. If I remember correctly those documents had an .mht
> extension. The result is a huge amount of ( nevertheless structured )
> markup gibberish together with text. If one spends time and attention
> one can find pattern in the markup ( we have XML and it's human
> readable ).
A related solution is to use OpenOffice to convert to
OpenDocumentFormat, a zipped multiple XML format, and then use ODFPY to
parse the XML and access the contents as linked objects.
http://opendocumentfellowship.com/development/projects/odfpy
More information about the Python-list
mailing list