Problem Converting Word to UTF8 Text File

patrick.waldo at gmail.com patrick.waldo at gmail.com
Sun Oct 21 14:32:57 EDT 2007


Indeed, the shutil.copyfile(doc,txt_doc) was causing the problem for
the reason you stated.  So, I changed it to this:

for doc in glob.glob(input):
    txt_split = os.path.splitext(doc)
    txt_doc = txt_split[0] + '.txt'
    txt_doc_dir = os.path.join(input_dir,txt_doc)
    doc_dir = os.path.join(input_dir,doc)
    shutil.copy(doc_dir,txt_doc_dir)


However, I still cannot read the unicode from the Word file.  If take
out the first for-statement, I get a bunch of garbled text, which
isn't helpful.  I would save them all manually, but I want to figure
out how to do it in Python, since I'm just beginning.

My intuition says the problem is with

FileFormat=win32com.client.constants.wdFormatText

because it converts fine to a text file, just not a utf-8 text file.
How can I  modify this or is there another way to code this type of
file conversion from *.doc to *.txt with unicode characters?

Thanks

On Oct 21, 7:02 pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
wrote:
> En Sun, 21 Oct 2007 13:35:43 -0300, <patrick.wa... at gmail.com> escribi?:
>
> > Hi all,
>
> > I'm trying to copy a bunch of microsoft word documents that have
> > unicode characters into utf-8 text files.  Everything works fine at
> > the beginning.  The word documents get converted and new utf-8 text
> > files with the same name get created.  And then I try to copy the data
> > and I keep on getting "TypeError: coercing to Unicode: need string or
> > buffer, instance found".  I'm probably copying the word document
> > wrong.  What can I do?
>
> Always remember to provide the full traceback.
> Where do you get the error? In the last line: shutil.copyfile?
> If the file already contains the text in utf-8, and you just want to make
> a copy, use shutil.copy as before.
> (or, why not tell Word to save the file using the .txt extension in the
> first place?)
>
> > for doc in glob.glob(input):
> >     txt_split = os.path.splitext(doc)
> >     txt_doc = txt_split[0] + '.txt'
> >     txt_doc = codecs.open(txt_doc,'w','utf-8')
> >     shutil.copyfile(doc,txt_doc)
>
> copyfile expects path names as arguments, not a
> codecs-wrapped-file-like-object
>
> --
> Gabriel Genellina





More information about the Python-list mailing list