Need debugging knowhow for my creeping Unicodephobia
jgardner at jonathangardner.net
Wed Feb 10 20:22:44 CET 2010
On Feb 10, 11:09 am, kj <no.em... at please.post> wrote:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
You'll have to understand some terminology first.
"codec" is a description of how to encode and decode unicode data to a
stream of bytes.
"decode" means you are taking a series of bytes and converting it to
"encode" is the opposite---take a unicode string and convert it to a
stream of bytes.
"ascii" is a codec that can only describe 0-127 with bytes 0-127.
"utf-8", "utf-16", etc... are other codecs. There's a lot of them.
Only some of them (ie, utf-8, utf-16) can encode all unicode. Most
(ie, ascii) can only do a subset of unicode.
In this case, you've fed a stream of bytes with 128 as one of the
bytes to the decoder. Since the decoder thinks it's working with
ascii, it doesn't know what to do with 128. There's a number of ways
to fix this:
(1) Feed it unicode instead, so it doesn't try to decode it.
(2) Tell it what encoding you are using, because it's obviously not
> FWIW, I'm using Python 2.6. The example above happens to come from
> a script that extracts data from HTML files, which are all in
> English, but they are a daily occurrence when I write code to
> process non-English text. The script uses Beautiful Soup. I won't
> post a lot of code because, as I said, what I'm after is not so
> much a way around this specific error as much as the tools and
> techniques to troubleshoot it and fix it on my own. But to ground
> the problem a bit I'll say that the exception above happens during
> the execution of a statement of the form:
> x = '%s %s' % (y, z)
> Also, I found that, with the exact same values y and z as above,
> all of the following statements work perfectly fine:
> x = '%s' % y
> x = '%s' % z
> print y
> print z
> print y, z
What are y and z? Are they unicode or strings? What are their values?
It sounds like someone, probably beautiful soup, is trying to turn
your strings into unicode. A full stacktrace would be useful to see
who did what where.
More information about the Python-list