[Python-ideas] Python 3000 TIOBE -3%

Wed Feb 15 03:43:41 CET 2012

Steven D'Aprano writes:
 > MRAB wrote:

 > >> encoding="ascii-ish"  # gets the sloppyness right

+0.8  I'd prefer the more precise "ascii-compatible".  Shift JIS is
"ASCII-ish", but should not be decoded with this codec.

 > > encoding="mojibake" # :-)
 > 
 > You have a smiley, but I think that's the best name I've seen yet. It's 
 > explicit in what you get -- mojibake.

Explicit, but incorrect.  Mojibake ("bake" means "change") is what you
get when you use one encoding to encode characters, and another to
decode them.  Here, not only are we talking about using the same codec
at both ends, but in fact it's inside out (we are decoding then
encoding).  This is GIGO, not mojibake.

 > why not just teach them the very slightly more complex recipe
 > 
 >      open(filename, encoding='ascii', errors='surrogateescape')
 > 
 > which captures the user's intent ("I want ASCII, with some way of
 > escaping errors so I don't have to deal with them") much more
 > accurately.

Why not?  Because 'surrogateescape' does not express the user's
intent.  That user *will* have to deal with errors as soon as she
invokes modules that validate their input, or include some portion of
the text being treated in output of any kind, unless they use an
error-suppressing handler themselves.  Surrogates are errors in
Unicode, and that's the way it should be.  That's precisely why Martin
felt it necessary to use this technique in PEP 383: to ensure that
errors *will* occur unless you are very careful in handling strings
produced with the surrogateescape handler active.

It's arguable that most applications *should* want errors in these
cases; I've made that argument myself.  But it's quite clearly not the
user's intent.