parsing MS word docs -- tutorial request
kay.schluehr at gmx.net
Wed Oct 29 18:52:12 CET 2008
On 28 Okt., 15:25, bp.tralfamad... at gmail.com wrote:
> I am trying to write a script that will parse and extract data from a
> MS Word document. Can / would anyone refer me to a tutorial on how to
> do that? (perhaps from tables). I am aware of, and have downloaded
> the pywin32 extensions, but am unsure of how to proceed -- I'm not
> familiar with the COM API for word, so help for that would also be
> Any help would be appreciated. Thanks for your attention and
One can convert MS-Word documents into some class of XML documents
called MHTML. If I remember correctly those documents had an .mht
extension. The result is a huge amount of ( nevertheless structured )
markup gibberish together with text. If one spends time and attention
one can find pattern in the markup ( we have XML and it's human
A few years ago I used this conversion to implement roughly following
1. I manually highlighted one or more sections in a Word doc using a
background colour marker.
2. I searched for the colour marked section and determined the
structure. The structure information was fed into a state machine.
3. With this state machine I searched for all sections that were
4. I applied a href link to the text that was surrounded by the
structure and removed the colour marker.
5. In another document I searched for the same text and set an anchor.
This way I could link two documents ( those were public specifications
being originally disconnected ).
More information about the Python-list