[Python-ideas] Python 3000 TIOBE -3%

Wed Feb 15 02:27:48 CET 2012

Nick Coghlan writes:

 > If you're only round-tripping (i.e. writing back out as
 > "ascii+surrogateescape")

This is the only case that makes sense in this thread.  We're talking
about people coming from Python 2 who want an encoding-agnostic way to
script ASCII-oriented operations for an ASCII-compatible environment,
and not to learn about encodings at all.

While my opinions on this are (probably obviously) informed by the
WSGI discussion, this is not about making life come up roses for the
WSGI folks.  They work in a sewer; life stinks for them, and all they
can do about it is to hold their noses.  This thread is about people
who are not trying to handle sewage in a sanitary fashion, rather just
cook a meal and ignore the occasional hairs that inevitably fall in.

 > However, it's trivial to get an error when you go to encode the data
 > stream without one of the silencing error handlers set.

Sure, but getting errors is for people who want to learn how to do it
right, not for people who just need to get a job done.  Cf. the
fevered opposition to giving "import cElementTree" a DeprecationWarning.

 > In particular, sys.stdout has error handling set to strict, which I
 > believe is likely to throw UnicodeEncodeError if you try to feed a
 > string containing surrogate escaped bytes to an encoding that can't
 > handle them.

No, it should *always* throw a UnicodeEncodeError, because there are
*no* encodings that can handle them -- they're not characters, so they
can't be encoded.

 > (Of course, if sys.stdout.encoding is "UTF-8", then you're right,
 > those characters will just be displayed as gibberish,

No, they will raise UnicodeEncodeError; that's why surrogateescape was
invented, to work around the problem of what to do with bytes that the
programmer knows are meaningful to somebody, but do not represent
characters as far as Python can know:

wideload:~ 10:06$ python3.2
Python 3.2 (r32:88445, Mar 20 2011, 01:56:57) 
[GCC 4.0.1 (Apple Inc. build 5490)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = b'\xff\xff'.decode('utf-8', errors='surrogateescape')
>>> s.encode('utf-8',errors='strict')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
position 0: surrogates not allowed
>>> 

The reason I advocate 'latin-1' (preferably under an appropriate
alias) is that you simply can't be sure that those surrogates won't be
passed to some module that decides to emit information about them
somewhere (eg, a warning or logging) -- without the protection of a
"silencing error handler".  Bang-bang! Python's silver hammer comes
down upon your head!