[Python-Dev] Relaxing Unicode error handling

Phillip J. Eby pje at telecommunity.com
Sat Jan 3 10:51:51 EST 2004


At 12:44 PM 1/3/04 +0100, Martin v. Loewis wrote:
>People keep complaining that they get Unicode errors
>when funny characters start showing up in their inputs.
>
>In some cases, these people would apparantly be happy
>if Python would just "keep going", IOW, they want to
>get moji-bake (garbled characters), instead of
>exceptions that abort their applications.
>
>I'd like to add a static property unicode.errorhandling,
>which defaults to "strict". Applications could set this
>to "replace" or "ignore", silencing the errors, and
>risking loss of data.
>
>What do you think?

When I've gotten UnicodeErrors, it pointed out an error in my programming 
that needed to be fixed - i.e., that I forgot what kind of strings I was 
dealing with, and needed to be explicit about it, or at least use the 
replace/ignore option *at the point of decoding*.  (Errors should not pass 
silently, unless explicitly silenced.)

A global setting makes it possible to create code that relies on the 
setting being one way or the other, and those pieces of code will not then 
work together.  (Only one obvious way to do it.)

Admittedly, my experience with using Unicode is very limited, dealing 
primarily with the ISO-8859-x and Japanese language codecs, with decoding 
fairly centralized.  It's possible that there are use cases I'm unfamiliar 
with that would scatter decode()'s all over the place, and that would make 
adding the "ignore" parameter to each use unbearably tedious.  OTOH, I 
don't think that adding more stateful globals to Python is a good idea, and 
what's the harm of having somebody write:

def garble(s,codec):
     s.decode(codec,'ignore')

Or, if it's desired that this be available as part of Python, perhaps 
adding 'decode_replace' and 'decode_ignore' staticmethods to the Unicode class?

Or, am I missing the point entirely, and there's some other circumstance 
where one gets UnicodeErrors besides .decode()?  If the use case is mixing 
strings and unicode objects (i.e. adding, joining, searching, etc.), then 
I'd have to say a big fat -1, as opposed to merely a -0 for having other 
ways to spell .decode(codec,"ignore").  If I in my youth had seen such a 
flag as you describe, I'd have used it, and then missed out on lots of very 
educational error messages.




More information about the Python-Dev mailing list