[Baypiggies] Python and MS Word docs; Argghhhhhhhhh......

Jeff Enderwick jeff.enderwick at gmail.com
Wed Feb 3 05:49:44 CET 2010

That sucks! In all seriousness, if you only have to support the later 'docx'
format, those things are actually a zip file packed with XML. Inside you can
find the text in an XML file (I think document.xml) and fish out the text.

If you have to support the older formats, there is an Apache POI project (I
have not played with it).

When I was up against this, I had the docs saved "as a web page" and then
consumed them with Beautiful Soup. I needed structure, style tags, etc, and
'Soup did well by me.

On Tue, Feb 2, 2010 at 8:30 PM, Glen Jarvis <glen at glenjarvis.com> wrote:

> No, that isn't pirate talk.. unless you want it to be...
> That is an interesting new problem that was put on my plate... at 8
> pm...  to be solved by morning....  argh matey...
> Unfortunately, many bioinformatics teams have a disconnect between
> computer science and biologists...  For example, I, as a computer
> scientist, sometimes hear Charlie Brown's teacher (mwah wah wah mwah)
> and don't understand even what I'm supposed to do (and thus I should
> take more biology courses)... And, the biologists sometimes don't
> understand the benefit and limitations of technology and what they are
> asking for...
> One of the things that has been requested of us to take an MS word
> file that has been used to enter plain text. The file should be
> uploaded via webpage (done). The file should strip out all of the MS
> Word formatting so that we process only the text (And why not just
> upload a plain text file again? hmm? This is what is *really* wanted).
> In my introduction to Python a few years ago, I remember reading that
> there are python modules to read MS Word. Can these libraries be run
> on Linux/Unix, or is a .NET framework needed (we're a Linux only
> shop)?
> Most importantly, can this be done? Please say we can do something
> like this on any platform because Python rocks:
> from dot_net import MSWordDoc
> word_file = open('my_example.doc', 'r')
> word_doc = MSWordDoc(word_file)
> word_file.close()
> text_only = word_doc.convert_to_text(encoding='ascii')
> Obviously I made up that syntax. If anyone ever finds this on the web
> looking for the same answer, *don't* use the the above code.. It's
> fake...
> Cheers,
> Glen
> P.S. Bonus if I can get an equivalent of the Unix "file" utility:
> > file sillywalk.doc
> sillywalk.doc: Microsoft Office Document
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100202/09ed4be3/attachment.htm>

More information about the Baypiggies mailing list