[Tutor] Convert doc to txt on Ubuntu

Rich Lovely roadierich at googlemail.com
Thu Sep 17 00:16:02 CEST 2009


2009/9/16 Carnell, James E <jecarnell at saintfrancis.com>:
>
> I am needing to access the text in hundreds of Microsoft .doc files on an
> Ubuntu OS. I looked at win32 , but only saw support for windows. I am going
> through all of these files to create a fairly simple text delimited file for
> a spreadsheet.
>
> A) Batch convert to text files so I can access them
> B) import some module that allows me to decode this format
> C) Open Office allows batch conversion to .odc ,but still don't know how to
> access
> D) Buy a 24 pack, some Twinkies, and go watch David Hasselhoff reruns
>
> Opening .txt documents works fine.
>
> Currently get:
>
> inFile = open("myTestFile.doc", "r")
> testRead = inFile.read()
>
> Traceback (most recent call last):
>   File "<pyshell#11>", line 1, in <module>
>     test = inFile.read()
>   File "/usr/lib/python3.0/io.py", line 1728, in read
>     decoder.decode(self.buffer.read(), final=True))
>   File "/usr/lib/python3.0/io.py", line 1299, in decode
>     output = self.decoder.decode(input, final=final)
>   File "/usr/lib/python3.0/codecs.py", line 300, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid
> data
>
> Any help greatly appreciated Thanks bunches.
>
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
>

FYI, open office .odc files are zip archives of xml files.  It should
be trivial to access the information from them, assuming OO is
sensible in converting from the bloated .doc format.

-- 
Rich "Roadie Rich" Lovely

There are 10 types of people in the world: those who know binary,
those who do not, and those who are off by one.


More information about the Tutor mailing list