Need debugging knowhow for my creeping Unicodephobia

Jonathan Gardner jgardner at jonathangardner.net
Wed Feb 10 20:22:44 CET 2010


On Feb 10, 11:09 am, kj <no.em... at please.post> wrote:
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
>

You'll have to understand some terminology first.

"codec" is a description of how to encode and decode unicode data to a
stream of bytes.

"decode" means you are taking a series of bytes and converting it to
unicode.

"encode" is the opposite---take a unicode string and convert it to a
stream of bytes.

"ascii" is a codec that can only describe 0-127 with bytes 0-127.
"utf-8", "utf-16", etc... are other codecs. There's a lot of them.
Only some of them (ie, utf-8, utf-16) can encode all unicode. Most
(ie, ascii) can only do a subset of unicode.

In this case, you've fed a stream of bytes with 128 as one of the
bytes to the decoder. Since the decoder thinks it's working with
ascii, it doesn't know what to do with 128. There's a number of ways
to fix this:

(1) Feed it unicode instead, so it doesn't try to decode it.

(2) Tell it what encoding you are using, because it's obviously not
ascii.

>
> FWIW, I'm using Python 2.6.  The example above happens to come from
> a script that extracts data from HTML files, which are all in
> English, but they are a daily occurrence when I write code to
> process non-English text.  The script uses Beautiful Soup.  I won't
> post a lot of code because, as I said, what I'm after is not so
> much a way around this specific error as much as the tools and
> techniques to troubleshoot it and fix it on my own.  But to ground
> the problem a bit I'll say that the exception above happens during
> the execution of a statement of the form:
>
>   x = '%s %s' % (y, z)
>
> Also, I found that, with the exact same values y and z as above,
> all of the following statements work perfectly fine:
>
>   x = '%s' % y
>   x = '%s' % z
>   print y
>   print z
>   print y, z
>

What are y and z? Are they unicode or strings? What are their values?

It sounds like someone, probably beautiful soup, is trying to turn
your strings into unicode. A full stacktrace would be useful to see
who did what where.



More information about the Python-list mailing list