[Baypiggies] Python and MS Word docs; Argghhhhhhhhh......

Glen Jarvis glen at glenjarvis.com
Wed Feb 3 05:30:47 CET 2010


No, that isn't pirate talk.. unless you want it to be...

That is an interesting new problem that was put on my plate... at 8
pm...  to be solved by morning....  argh matey...

Unfortunately, many bioinformatics teams have a disconnect between
computer science and biologists...  For example, I, as a computer
scientist, sometimes hear Charlie Brown's teacher (mwah wah wah mwah)
and don't understand even what I'm supposed to do (and thus I should
take more biology courses)... And, the biologists sometimes don't
understand the benefit and limitations of technology and what they are
asking for...

One of the things that has been requested of us to take an MS word
file that has been used to enter plain text. The file should be
uploaded via webpage (done). The file should strip out all of the MS
Word formatting so that we process only the text (And why not just
upload a plain text file again? hmm? This is what is *really* wanted).

In my introduction to Python a few years ago, I remember reading that
there are python modules to read MS Word. Can these libraries be run
on Linux/Unix, or is a .NET framework needed (we're a Linux only
shop)?

Most importantly, can this be done? Please say we can do something
like this on any platform because Python rocks:

from dot_net import MSWordDoc

word_file = open('my_example.doc', 'r')
word_doc = MSWordDoc(word_file)
word_file.close()

text_only = word_doc.convert_to_text(encoding='ascii')

Obviously I made up that syntax. If anyone ever finds this on the web
looking for the same answer, *don't* use the the above code.. It's
fake...

Cheers,


Glen
P.S. Bonus if I can get an equivalent of the Unix "file" utility:
> file sillywalk.doc
sillywalk.doc: Microsoft Office Document


More information about the Baypiggies mailing list