encode/decode misunderstanding

Tim Arnold tim.arnold at sas.com
Thu Jul 26 21:16:50 CEST 2007


Hi, I'm beginning to understand the encode/decode string methods, but I'd 
like confirmation that I'm still thinking in the right direction:

I have a file of latin1 encoded text. Let's say I put one line of that file 
into a string variable 'tocline', as follows:
tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n'

import codecs
tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace')
tocline = tocline.decode('latin1','replace')
tocFile.write(tocline)
tocFile.close()

What I think is that tocFile is wrapped to insure that anything written to 
it is in utf8
I decode the latin1 string into python's internal unicode encoding and that 
gets written out as utf8.

Questions:
what exactly is the tocline when it's read in with that \xe9 and \xed in the 
string? A latin1 encoded string?
Is my method the right way to write such a line out to a file with utf8 
encoding?

If I read in the latin1 file using
codecs.open(filename,encoding='latin1') and write out the utf8 file by 
opening with
codecs.open(othername,encoding='utf8'), would I no longer have a problem --  
I could just read in latin1 and write out utf8 with no more worries about 
encoding?

thanks,
--Tim





More information about the Python-list mailing list