[Python-ideas] Processing surrogates in

Sat May 16 16:15:10 CEST 2015

Paul Moore writes:

 > The stream in this case was sys.stdout, which you can't blame me
 > for, though :-)

Yeah, I think there's an issue or two on that.

 > I don't know enough about the issues to make a good case that
 > errors='strict' is the wrong error handling policy for sys.stdout,
 > though.

No, errors='strict' is always the right default policy, especially for
UTF-encoded output, but for other encodings as well.

 > And you can't change the policy on an existing stream,

Hm.  I would not want the job of rewriting the codec machinery to
guarantee that users would get what they deserve from changing
encodings on a stream -- I suspect that would be hard, or even
impossible for a stateful encoding (eg, a 7-bit ISO-2022 encoding).
But I can't really see where the harm would be in allowing changes of
the error handler.  (Of course that goes in the categories of "for
consenting adults" and "you can keep any bullets that lodge in your
foot".)  I'll have to think hard about it.

 > so the application is stuck with strict unless it wants to re-wrap
 > sys.stdout.buffer (which I'm always a little reluctant to do, as it
 > seems like it may cause other issues, although I don't know why I
 > think that :-)).

In your case, I don't see why it would cause a problem unless there's
other output potentially incompatible with the sys.stdout encoding
that *you* *do* want errors on.  I can imagine there exist cases where
you have something like log output where you *know* that the logger
produces 30 columns of ASCII and then up to 45 columns copied from its
input, and only the first 30 "really need" to be accurate and valid in
the output encoding.  (I don't actually have such a case to hand,
though -- I've never seen a logger that randomly inserted Japanese in
timestamps or something like that.)