[Tutor] Parsing Word Docs

Thu Mar 8 15:40:22 CET 2007

Hello all,

I have a directory containing a load of word documents, say 100 or so.
which is updated every hour.

I want a cgi script that effectively does a grep on the word docs, and
returns each doc that matches the search term.

I've had a look at doing this by looking at each binary file and
reimplementing strings(1) to capture useful info.  I've also read that
one can treat a word doc as a COM object.  Am I right in thinking that
I can't do this on python under unix?

What other ways are there?  Or is the binary parsing the way to go?

S.