Converting .doc to .txt in Linux

Carl Banks pavlovevidence at gmail.com
Fri Sep 5 06:31:07 CEST 2008


On Sep 4, 4:18 pm, Tommy Nordgren <tommy.nordg... at comhem.se> wrote:
> On Sep 4, 2008, at 9:54 PM, patrick.wa... at gmail.com wrote:
>
>
>
> > Hi Everyone,
>
> > I had previously asked a similar question,
> >http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
>
> > but at that point I was using Windows and now I am using Linux.
> > Basically, I have some .doc files that I need to convert into txt
> > files encoded in utf-8.  However, win32com.client doesn't work in
> > Linux.
>
> > It's been giving me quite a headache all day.  Any ideas would be
> > greatly appreciated.
>
> > Best,
> > Patrick
>
> > #Windows Code:
> > import glob,os,codecs,shutil,win32com.client
> > from win32com.client import Dispatch
>
> > input = '/home/pwaldo2/work/workbench/current_documents/*.doc'
> > input_dir = '/home/pwaldo2/work/workbench/current_documents/'
> > outpath = '/home/pwaldo2/work/workbench/current_documents/TXT/'
>
> > for doc in glob.glob1(input):
> >    WordApp = Dispatch("Word.Application")
> >    WordApp.Visible = 1
> >    WordApp.Documents.Open(doc)
> >    WordApp.ActiveDocument.SaveAs(doc,7)
> > WordApp.ActiveDocument.Close()
> > WordApp.Quit()
>
> > for doc in glob.glob(input):
> >    txt_split = os.path.splitext(doc)
> >    txt_doc = txt_split[0] + '.txt'
> >    txt_doc_path = os.path.join(outpath,txt_doc)
> >    doc_path = os.path.join(input_dir,doc)
> >    shutil.copy(doc_path,txt_doc_path)
> > --
> >http://mail.python.org/mailman/listinfo/python-list
>
>         You can do it manually with Open Office. <http://www.openoffice.org/>
> A free office suite.

On Debian there is a package called "unoconv"--written in Python--that
can do the conversions from the command line.  It requires a running
instance of Open Office.  However, the doc-to-txt conversion of Open
Office isn't that good.  (It wasn't as good as Word's formatted text
converter, last time I used it.)


Carl Banks



More information about the Python-list mailing list