[Tutor] extracting text from word files (.doc, .docx) and pdf

Emile van Sebille emile at fenx.com
Wed Jan 26 00:59:38 CET 2011


On 1/25/2011 1:52 PM Juan Jose Del Toro said...
> Dear List;
>
> I am looking for a way to extract parts of a text from word (.doc,.docx)

I recently did a project extracting data from word documents and used 
antiword (http://www.winfield.demon.nl/) then used it like this:

def setContent(self):
     self.content =
       [
         ii.strip().replace("Ëš","")
         for ii in
         commands.getoutput('/usr/local/bin/antiword "%s"' % 
doc).split("\n")
         if ii
       ]


Emile



More information about the Tutor mailing list