[Python-ideas] changing sys.stdout encoding

Sat Jun 9 05:57:01 CEST 2012

On 06/07/2012 07:01 PM, Nick Coghlan wrote:
> On Fri, Jun 8, 2012 at 10:14 AM, Rurpy <rurpy-/E1597aS9LQAvxtiuMwx3w at public.gmane.org> wrote:
>> On 06/07/2012 03:45 PM, Nick Coghlan wrote:
>>> If user level code doesn't want those streams, it needs to
>>> replace them with something else.
>>
>> Yes, this is what the code I googled up does:
>>  import codecs
>>  sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer)
>> But that code is not obvious to someone who has been able to do
>> all his encoded IO (with the exception of sys.stdout) using just
>> the encoding parameter of open().  Hence my question if some-
>> thing like a set_encoding() method/function that would work on
>> sys.stdout is feasible.  I don't see an answer to that in your
>> statement above.

First, thanks for the detailed response.

> Right, I was only trying to explain why the standard streams are a
> special case - because they're also used by the interpreter, and it
> makes the startup process much simpler if the interpreter retains
> complete control over the way they're initialised (it's already
> complicated by the fact we need to get something half-usable in place
> as sys.stderr so that error reporting is possible while initialising
> them properly). It then becomes an application level operation to
> replace them if desired.

OK, I can see that as a use-case design principle.  I still
don't see any hard technical reason why the same streams could
not be kept and simply allow their encoding's to be reset if
they haven't been used yet.  In other words, does that 
principle provide sufficient value to compensate for ruling 
out several possible solutions to based on modify the current
stream rather than rewrapping it?

> We can (and do) make the internal standard stream initialisation
> configurable, but it then becomes a UI design problem to get something
> that balances flexibility against complexity. PYTHONIOENCODING (in
> association with OS utilities that make it possible to set an
> environment variable for a specific process invocation, as well as
> support in the subprocess module for passing a tailored environment to
> subprocesses) is our current solution.
> 
> The interpreter design aims, first and foremost, to provide a simple
> and straightforward experience in POSIX environments that use UTF-8
> everywhere (since that's the most sane approach available for
> migrating from a previously ASCII-based computing world). Windows is a
> bit trickier (due to the internal use of UTF-16 APIs and the lack of
> POSIX-style support for temporarily setting an environment variable
> when invoking a process from the shell), but correctly supporting that
> environment is also a very high priority. The fallback behaviours when
> these situations do not apply are designed to work best on systems
> that are, at least somewhat *locally* consistent.

But networks, shared files systems, email, etc have all
blurred the concept of localness.  Just because I am running
my program on a Unix machine does not mean I may not need
to write files with '\n\r' line endings.

Perhaps another way to view it is that Python is wrongly 
subsuming part of the problem space into the system space.  
The need to read or write disparate encodings is a function 
of the problem being addressed (which includes how problem
data is encoded just as much as whether it is formatted as 
CSV or as labeled name-value pairs); it's not really a 
function of my local system environment.

> The real world is complex. Eventually, our answer has to be "handle it
> at the application level, there are too many variations for us to
> support it directly at the interpreter level". Currently, any standard
> stream encoding related problem that can't be handled with
> PYTHONIOENCODING is just such a situation. We know it sucks for
> multi-encoding environments, but those are a nightmare for a lot of
> reasons and are the main drivers behind the industry-wide effort to
> standardise on Unicode text handling, including universal encodings
> like UTF-8.

I think "nightmare" is a little too strong.  PITA maybe, 
particularly before one's gotten tools and environment 
worked out.  Eventually one can get used to seeing Windows
path separators displayed as yen signs in cmd.exe windows. :-) 
I think of it as just another annoyance imposed by the 
real world -- like making sure backups run exactly once 
a night even across dst changes.

> So now we're down to the question of how much complexity we're willing
> to tolerate in the interpreter specifically for the sake of
> environments where:
> 1. The automatic standard stream encoding calculation gives the wrong answer
> 2. The PYTHONIOENCODING override is insufficient
> 3. The application being executed isn't already handling the problem
> 4. A -m executable helper module (or directly executable helper
> script) can't be used to initialise the standard streams correctly
> before continuing on to execute the requested application via the
> runpy module

In the options you give above, it seems to me that all 
(except 3, and maybe 4; I only use -m only for pdb) there 
seems to be an implicit assumption that there is a single 
encoding that needs to be determined.

But that is wrong.  There are three streams and each
of those streams may need a different encoding.  Python
gets this in the case of explicitly opened files... no
one would dream of having a sys.encoding setting replace
the open(encoding=...) parameter.  What Python is missing
is that the same applies to stdin, stdout and stderr.

PYTHONIOENCODING is fine for what it is; it is just not
meant for my particular issue.

My proposal was simply to allow your option (3) to address 
this.  (Or more accurately, that it address this on a near 
equal footing to explicitly opened streams for reasons of 
both ease of use and python api consistency.)

> And the answer is "not much". About the only likely way forward I can
> see for streamlining this situation would be to treat this as another
> use case for http://bugs.python.org/issue14803, which proposes the
> ability to run snippets of Python code prior to execution of __main__.

That (IIUC) would not be workable for my problem.

  ./myprog.py -e sjis,sjis [other options...]

is acceptable.  Something like:

  python -C 'sys.stdin=...; sys.stdout=...' myprog.py [other options...]

would not be.  And since you mentioned it above, nor would:

  python -m setstdin_sjis -m setstdout_sjis myprog.py [other options...]

> I do agree that "create a new IO object that is like this old IO
> object but with these settings changed" could probably do with a
> better official API, but such an API needs to be designed with a
> respect for the issues associated with changing encodings "on the fly"
> and ask serious questions about whether or not we should be
> encouraging that practice by making it easier than it is already. I
> thought I had posted a tracker issue to that effect, but I can't find
> it now.

I think that being unable to easily change stream encoding 
before first use is orders of magnitude more important than
being unable to change them on-the-fly.  I mentioned the
latter only because I thought it might fall out naturally
from fixing the first problem, and might occasionally be 
useful.  (I mentioned a couple cases I've encountered but 
even I, who am very much in favor of generality, have to
admit I think the uses are rare.)

I acknowledge though that even a before-first-use api (which
I think could be implemented before an on-the-fly one) would
have to take the possible later existence of the latter into 
account.