UnicodeDecodeError help please?

Robert Kern robert.kern at gmail.com
Fri Apr 7 12:50:43 EDT 2006


Robin Haswell wrote:
> Okay I'm getting really frustrated with Python's Unicode handling, I'm
> trying everything I can think of an I can't escape Unicode(En|De)codeError
> no matter what I try.

Have you read any of the documentation about Python's Unicode support? E.g.,

  http://effbot.org/zone/unicode-objects.htm

> Could someone explain to me what I'm doing wrong here, so I can hope to
> throw light on the myriad of similar problems I'm having? Thanks :-)
> 
> Python 2.4.1 (#2, May  6 2005, 11:22:24) 
> [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 
>>>>import sys
>>>>sys.getdefaultencoding()
> 
> 'utf-8'

How did this happen? It's supposed to be 'ascii' and not user-settable.

>>>>import htmlentitydefs
>>>>char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a copyright symbol
>>>>print char
> 
> ©
> 
>>>>str = u"Apple"
>>>>print str
> 
> Apple
> 
>>>>str + char
> 
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte
> 
>>>>a = str+char
> 
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte

The values in htmlentitydefs.entitydefs are encoded in latin-1 (or are numeric
entities which you still have to parse). So decode using the latin-1 codec.

-- 
Robert Kern
robert.kern at gmail.com

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco




More information about the Python-list mailing list