[Python-ideas] Processing surrogates in

Nick Coghlan ncoghlan at gmail.com
Sat May 16 16:44:52 CEST 2015


On 16 May 2015 at 20:19, Paul Moore <p.f.moore at gmail.com> wrote:
> On 16 May 2015 at 04:56, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>> That's not the use case envisioned for these functions, though.  You
>> want to change the textual content of the stream (by restricting the
>> repertoire), not change the representation of non-textual content.
>
> Thanks. I see the difference now. (Plus Nick's point about needing to
> know the encoding in my use case).
>
>>  > The encode/decode pair seemed ugly, although it was the only way I
>>  > could find.
>>
>> I find the fact that there's an output stream with an inappropriate
>> error handler far uglier!
>
> The stream in this case was sys.stdout, which you can't blame me for, though :-)
>
> The use case in question was specifically wanting to avoid encoding
> errors when printing arbitrary text. (On Windows, where
> sys.stdout.encoding is not UTF-8). This is a pretty common issue that
> I see raised a lot, and it is frustrating to have to deal with it in
> application code. I don't know enough about the issues to make a good
> case that errors='strict' is the wrong error handling policy for
> sys.stdout, though. And you can't change the policy on an existing
> stream, so the application is stuck with strict unless it wants to
> re-wrap sys.stdout.buffer (which I'm always a little reluctant to do,
> as it seems like it may cause other issues, although I don't know why
> I think that :-)).

It has the potential to cause problems if anything still has a
reference to the old stream (such as, say, sys.__stdout__, or an
eagerly bound reference in a default argument value). If you call
detach(), the old references will be entirely broken, if you don't
then you have two different text wrappers sharing the same underlying
buffered stream. Creating a completely new IO stream that only shares
the operating system level file descriptor has similar data
interleaving problems to the latter approach.

There's an open issue to support changing the encoding and error
handling of an existing stream in place, which I'd suggested deferring
to 3.6 based on the fact we're switching the *nix streams to use
surrogateescape if the system claims the locale encoding is ASCII:
http://bugs.python.org/issue15216#msg242942

However, it the lack of that capability is causing problems on Windows
as well, then it may be worth updating Nikolaus Rath's patch and
applying it for 3.5 and dealing with the consequences. The main reason
I've personally been wary of the change is because I expect there to
be various edge cases encountered with different codecs, so I suspect
that adding this feature will be setting the stage for an
"interesting" collection of future bug reports. On the other hand,
there's certain kinds of programs (like an iconv equivalent) that
could most readily be implemented by being able to change the encoding
of the standard streams based on application level configuration
settings, which means having a way to override the default settings
chosen by the interpreter.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-ideas mailing list