Need debugging knowhow for my creeping Unicodephobia
python at mrabarnett.plus.com
Wed Feb 10 21:05:55 CET 2010
> Some people have mathphobia. I'm developing a wicked case of
> I have read a *ton* of stuff on Unicode. It doesn't even seem all
> that hard. Or so I think. Then I start writing code, and WHAM:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
> (There, see? My Unicodephobia just went up a notch.)
> Here's the thing: I don't even know how to *begin* debugging errors
> like this. This is where I could use some help.
> In the past I've gone for method of choice of the clueless:
> "programming by trial-and-error", try random crap until something
> "works." And if that "strategy" fails, I come begging for help to
> c.l.p. And thanks for the very effective pointers for getting rid
> of the errors.
> But afterwards I remain as clueless as ever... It's the old "give
> a man a fish" vs. "teach a man to fish" story.
> I need a systematic approach to troubleshooting and debugging these
> Unicode errors. I don't know what. Some tools maybe. Some useful
> modules or builtin commands. A diagnostic flowchart? I don't
> think that any more RTFM on Unicode is going to help (I've done it
> in spades), but if there's a particularly good write-up on Unicode
> debugging, please let me know.
> Any suggestions would be much appreciated.
> FWIW, I'm using Python 2.6. The example above happens to come from
> a script that extracts data from HTML files, which are all in
> English, but they are a daily occurrence when I write code to
> process non-English text. The script uses Beautiful Soup. I won't
> post a lot of code because, as I said, what I'm after is not so
> much a way around this specific error as much as the tools and
> techniques to troubleshoot it and fix it on my own. But to ground
> the problem a bit I'll say that the exception above happens during
> the execution of a statement of the form:
> x = '%s %s' % (y, z)
> Also, I found that, with the exact same values y and z as above,
> all of the following statements work perfectly fine:
> x = '%s' % y
> x = '%s' % z
> print y
> print z
> print y, z
Decode all text input; encode all text output; do all text processing
in Unicode, which also means making all text literals Unicode (prefixed
Note: I'm talking about when you're working with _text_, as distinct
from when you're working with _binary data_, ie bytes.
More information about the Python-list