UnicodeDecodeError help please?
Robert Kern
robert.kern at gmail.com
Fri Apr 7 12:50:43 EDT 2006
Robin Haswell wrote:
> Okay I'm getting really frustrated with Python's Unicode handling, I'm
> trying everything I can think of an I can't escape Unicode(En|De)codeError
> no matter what I try.
Have you read any of the documentation about Python's Unicode support? E.g.,
http://effbot.org/zone/unicode-objects.htm
> Could someone explain to me what I'm doing wrong here, so I can hope to
> throw light on the myriad of similar problems I'm having? Thanks :-)
>
> Python 2.4.1 (#2, May 6 2005, 11:22:24)
> [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>
>>>>import sys
>>>>sys.getdefaultencoding()
>
> 'utf-8'
How did this happen? It's supposed to be 'ascii' and not user-settable.
>>>>import htmlentitydefs
>>>>char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
>>>>print char
>
> ©
>
>>>>str = u"Apple"
>>>>print str
>
> Apple
>
>>>>str + char
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte
>
>>>>a = str+char
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte
The values in htmlentitydefs.entitydefs are encoded in latin-1 (or are numeric
entities which you still have to parse). So decode using the latin-1 codec.
--
Robert Kern
robert.kern at gmail.com
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
More information about the Python-list
mailing list