Trouble with unicode
M.-A. Lemburg
mal at lemburg.com
Tue May 15 07:35:17 EDT 2001
Charlie Clark wrote:
>
> I'm having trouble convert the contents of e-mails stored as unicode
> files into plain text. I'm not sure if I've understood how to deal with
> unicode :-(
>
> As usual the problem is with non-ascii characters.
>
> For example I have the following characters in the mail:
> "ä, Ä, ö, Ö, ü, Ü, ß"
>
> when I read the mail in Python as a string I get:
> "\xe4, \xc4, \xf6, \xd6, \xfc, \xdc, \xdf"
>
> I've followed the example from
> http://www.python.org/2.0/new-python.html
> but don't seem to be getting very far and ascii_decode() gives me the
> following error:
> "UnicodeError: ASCII decoding error: ordinal not in range(128)"
First you should check which encoding your Unicode file uses
(e.g. sometimes Unicode refers to UTF-16 or just UTF-16-LE). Then
you should read the file using codecs.open():
# replace encoding with 'utf-16' or 'utf-16-le' or 'utf-16-be'
f = codecs.open(filename, 'rb', encoding)
contents = f.read()
f.close()
Now you can convert the Unicode object contents into a plain
string using some other encoding, e.g. Latin-1, and then
write it back to a text file:
plaintext = contents.encode('latin-1')
open(outfilename, 'w').write(plaintext)
--
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/
More information about the Python-list
mailing list