unicode problem?

Chris Rebert crebert at ucsd.edu
Sun Oct 10 05:39:34 CEST 2010

On Sat, Oct 9, 2010 at 4:59 PM, Brian Blais <bblais at bryant.edu> wrote:
> This may be a stemming from my complete ignorance of unicode, but when I do this (Python 2.6):
> s='\xc2\xa9 2008 \r\n'
> and I want the ascii version of it, ignoring any non-ascii chars, I thought I could do:
> s.encode('ascii','ignore')
> but it gives the error:
> In [20]:s.encode('ascii','ignore')
> ----------------------------------------------------------------------------
> UnicodeDecodeError                        Traceback (most recent call last)
> /Users/bblais/python/doit100810a.py in <module>()
> ----> 1
>      2
>      3
>      4
>      5
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
> am I doing something stupid here?

In addition to Benjamin's explanation:

Unicode strings in Python are of type `unicode` and written with a
leading "u"; e.g. u"A unicode string for ¥500". Byte strings lack the
leading "u"; e.g. "A plain byte string". Note that "Unicode string"
does not refer to strings which have been encoded using a Unicode
encoding (e.g. UTF-8); such strings are still byte strings, for
encodings emit bytes.

As to why you got the /exact/ error you did:
As a backward compatibility hack, in order to satisfy your nonsensical
encoding request, Python implicitly tried to decode the byte string
`s` using ASCII as a default (the choice of ASCII here has nothing to
do with the fact that you specified ASCII in your encoding request),
so that it could then try and encode the resulting unicode string;
hence why you got a Unicode*De*codeError as opposed to a
Unicode*En*codeError, despite the fact you called *en*code().

Highly suggested further reading:
"The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)"


More information about the Python-list mailing list