Nick Coghlan writes:
If you're only round-tripping (i.e. writing back out as "ascii+surrogateescape")
This is the only case that makes sense in this thread. We're talking about people coming from Python 2 who want an encoding-agnostic way to script ASCII-oriented operations for an ASCII-compatible environment, and not to learn about encodings at all. While my opinions on this are (probably obviously) informed by the WSGI discussion, this is not about making life come up roses for the WSGI folks. They work in a sewer; life stinks for them, and all they can do about it is to hold their noses. This thread is about people who are not trying to handle sewage in a sanitary fashion, rather just cook a meal and ignore the occasional hairs that inevitably fall in.
However, it's trivial to get an error when you go to encode the data stream without one of the silencing error handlers set.
Sure, but getting errors is for people who want to learn how to do it right, not for people who just need to get a job done. Cf. the fevered opposition to giving "import cElementTree" a DeprecationWarning.
In particular, sys.stdout has error handling set to strict, which I believe is likely to throw UnicodeEncodeError if you try to feed a string containing surrogate escaped bytes to an encoding that can't handle them.
No, it should *always* throw a UnicodeEncodeError, because there are *no* encodings that can handle them -- they're not characters, so they can't be encoded.
(Of course, if sys.stdout.encoding is "UTF-8", then you're right, those characters will just be displayed as gibberish,
s = b'\xff\xff'.decode('utf-8', errors='surrogateescape') s.encode('utf-8',errors='strict') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
No, they will raise UnicodeEncodeError; that's why surrogateescape was invented, to work around the problem of what to do with bytes that the programmer knows are meaningful to somebody, but do not represent characters as far as Python can know: wideload:~ 10:06$ python3.2 Python 3.2 (r32:88445, Mar 20 2011, 01:56:57) [GCC 4.0.1 (Apple Inc. build 5490)] on darwin Type "help", "copyright", "credits" or "license" for more information. position 0: surrogates not allowed
The reason I advocate 'latin-1' (preferably under an appropriate alias) is that you simply can't be sure that those surrogates won't be passed to some module that decides to emit information about them somewhere (eg, a warning or logging) -- without the protection of a "silencing error handler". Bang-bang! Python's silver hammer comes down upon your head!