[Python-Dev] Relaxing Unicode error handling
Phillip J. Eby
pje at telecommunity.com
Sat Jan 3 10:51:51 EST 2004
At 12:44 PM 1/3/04 +0100, Martin v. Loewis wrote:
>People keep complaining that they get Unicode errors
>when funny characters start showing up in their inputs.
>
>In some cases, these people would apparantly be happy
>if Python would just "keep going", IOW, they want to
>get moji-bake (garbled characters), instead of
>exceptions that abort their applications.
>
>I'd like to add a static property unicode.errorhandling,
>which defaults to "strict". Applications could set this
>to "replace" or "ignore", silencing the errors, and
>risking loss of data.
>
>What do you think?
When I've gotten UnicodeErrors, it pointed out an error in my programming
that needed to be fixed - i.e., that I forgot what kind of strings I was
dealing with, and needed to be explicit about it, or at least use the
replace/ignore option *at the point of decoding*. (Errors should not pass
silently, unless explicitly silenced.)
A global setting makes it possible to create code that relies on the
setting being one way or the other, and those pieces of code will not then
work together. (Only one obvious way to do it.)
Admittedly, my experience with using Unicode is very limited, dealing
primarily with the ISO-8859-x and Japanese language codecs, with decoding
fairly centralized. It's possible that there are use cases I'm unfamiliar
with that would scatter decode()'s all over the place, and that would make
adding the "ignore" parameter to each use unbearably tedious. OTOH, I
don't think that adding more stateful globals to Python is a good idea, and
what's the harm of having somebody write:
def garble(s,codec):
s.decode(codec,'ignore')
Or, if it's desired that this be available as part of Python, perhaps
adding 'decode_replace' and 'decode_ignore' staticmethods to the Unicode class?
Or, am I missing the point entirely, and there's some other circumstance
where one gets UnicodeErrors besides .decode()? If the use case is mixing
strings and unicode objects (i.e. adding, joining, searching, etc.), then
I'd have to say a big fat -1, as opposed to merely a -0 for having other
ways to spell .decode(codec,"ignore"). If I in my youth had seen such a
flag as you describe, I'd have used it, and then missed out on lots of very
educational error messages.
More information about the Python-Dev
mailing list