unicode problem?

Benjamin Kaplan benjamin.kaplan at case.edu
Sun Oct 10 05:19:43 CEST 2010

On Sat, Oct 9, 2010 at 7:59 PM, Brian Blais <bblais at bryant.edu> wrote:
> This may be a stemming from my complete ignorance of unicode, but when I do this (Python 2.6):
> s='\xc2\xa9 2008 \r\n'
> and I want the ascii version of it, ignoring any non-ascii chars, I thought I could do:
> s.encode('ascii','ignore')
> but it gives the error:
> In [20]:s.encode('ascii','ignore')
> ----------------------------------------------------------------------------
> UnicodeDecodeError                        Traceback (most recent call last)
> /Users/bblais/python/doit100810a.py in <module>()
> ----> 1
>      2
>      3
>      4
>      5
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
> am I doing something stupid here?
> of course, as a workaround, I can do: ''.join([c for c in s if ord(c)<128])
> but I thought the encode call should work.
>                thanks,
>                        bb

Encode takes a Unicode string (made up of code points) and turns it
into a byte string (a sequence of bytes). In your case, you don't have
a Unicode string. You have a byte string. In order to encode that
sequence of bytes into a different encoding, you have to first figure
out what those bytes mean (decode it). Python has no way of knowing
that your strings are UTF-8 so it just tries ascii as the default.

You can either decode the byte string explicitly or (if it's actually
a literal in your code) just specify it as a Unicode string.
s = u'\u00a9 2008'

The encode vs. decode confusion was removed in Python 3: byte strings
don't have an encode method and unicode strings don't have a decode

> --
> Brian Blais
> bblais at bryant.edu
> http://web.bryant.edu/~bblais
> http://bblais.blogspot.com/
> --
> http://mail.python.org/mailman/listinfo/python-list

More information about the Python-list mailing list