Trouble with unicode

M.-A. Lemburg mal at lemburg.com
Tue May 15 07:35:17 EDT 2001


Charlie Clark wrote:
> 
> I'm having trouble convert the contents of e-mails stored as unicode
> files into plain text. I'm not sure if I've understood how to deal with
> unicode :-(
> 
> As usual the problem is with non-ascii characters.
> 
> For example I have the following characters in the mail:
> "ä, Ä, ö, Ö, ü, Ü, ß"
> 
> when I read the mail in Python as a string I get:
> "\xe4, \xc4, \xf6, \xd6, \xfc, \xdc, \xdf"
> 
> I've followed the example from
> http://www.python.org/2.0/new-python.html
> but don't seem to be getting very far and ascii_decode() gives me the
> following error:
> "UnicodeError: ASCII decoding error: ordinal not in range(128)"

First you should check which encoding your Unicode file uses
(e.g. sometimes Unicode refers to UTF-16 or just UTF-16-LE). Then
you should read the file using codecs.open():

# replace encoding with 'utf-16' or 'utf-16-le' or 'utf-16-be'
f = codecs.open(filename, 'rb', encoding)
contents = f.read()
f.close()

Now you can convert the Unicode object contents into a plain
string using some other encoding, e.g. Latin-1, and then
write it back to a text file:

plaintext = contents.encode('latin-1')
open(outfilename, 'w').write(plaintext)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/




More information about the Python-list mailing list