Re: [Python-ideas] changing sys.stdout encoding

On 06/07/2012 03:45 PM, Nick Coghlan wrote:
The interpreter uses the standard streams internally, and they're one of the first things created during interpreter startup. User provided code doesn't start running until well after they're initialised.
In other words, the stream objects referenced by sys.std* are opened before the user code runs? But if there are no operations on those streams until my user code runs, they are still in the same state they were after they were initialized, yes? So if one wanted to provide an "only before first use" set_encoding() function, why couldn't that function reexecute the codecs part of the initialization code a second time? Of course there would need to be some sort of flag that it could use to verify the stream was still in its initial state.
If user level code doesn't want those streams, it needs to replace them with something else.
Yes, this is what the code I googled up does: import codecs sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer) But that code is not obvious to someone who has been able to do all his encoded IO (with the exception of sys.stdout) using just the encoding parameter of open(). Hence my question if some- thing like a set_encoding() method/function that would work on sys.stdout is feasible. I don't see an answer to that in your statement above.

On Thu, Jun 7, 2012 at 5:14 PM, Rurpy <rurpy@yahoo.com> wrote:
On 06/07/2012 03:45 PM, Nick Coghlan wrote:
The interpreter uses the standard streams internally, and they're one of the first things created during interpreter startup. User provided code doesn't start running until well after they're initialised.
In other words, the stream objects referenced by sys.std* are opened before the user code runs?
But if there are no operations on those streams until my user code runs, they are still in the same state they were after they were initialized, yes?
So if one wanted to provide an "only before first use" set_encoding() function, why couldn't that function reexecute the codecs part of the initialization code a second time? Of course there would need to be some sort of flag that it could use to verify the stream was still in its initial state.
If user level code doesn't want those streams, it needs to replace them with something else.
Yes, this is what the code I googled up does: import codecs sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer)
What if codecs contained convenience methods for stdin and stdout? I.e. the above could be written more simply as import codecs codecs.encode_stdout(opts.encoding) This is much more memorable than the current option, and would also make life easier when working with fileinput (whose openhook argument can be set to control encoding of input *file* streams, but when it falls back to stdin this preference is ignored).
But that code is not obvious to someone who has been able to do all his encoded IO (with the exception of sys.stdout) using just the encoding parameter of open(). Hence my question if some- thing like a set_encoding() method/function that would work on sys.stdout is feasible. I don't see an answer to that in your statement above.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

On 06/07/2012 06:59 PM, Nathan Schneider wrote:
On Thu, Jun 7, 2012 at 5:14 PM, Rurpy <rurpy-/E1597aS9LQAvxtiuMwx3w@public.gmane.org> wrote:
On 06/07/2012 03:45 PM, Nick Coghlan wrote: [...]
level code doesn't want those streams, it needs to replace them with something else.
Yes, this is what the code I googled up does: import codecs sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer)
What if codecs contained convenience methods for stdin and stdout? I.e. the above could be written more simply as
import codecs codecs.encode_stdout(opts.encoding)
This is much more memorable than the current option, and would also make life easier when working with fileinput (whose openhook argument can be set to control encoding of input *file* streams, but when it falls back to stdin this preference is ignored).
How ironic. In Python2 I hated having to import codecs and use codecs.open() (the only thing I ever used from the codecs module) rather than just having an encoding parameter on open(). But seems like might be a reasonable thing to do. I'm sure there will be opinions. :-). It's not just sys.stdout though, the same issue exists with sys.stdin and sys.stderr so one might want either three functions, or one function that includes the a stream as parameter.

On Fri, Jun 8, 2012 at 10:14 AM, Rurpy <rurpy@yahoo.com> wrote:
On 06/07/2012 03:45 PM, Nick Coghlan wrote:
If user level code doesn't want those streams, it needs to replace them with something else.
Yes, this is what the code I googled up does: import codecs sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer) But that code is not obvious to someone who has been able to do all his encoded IO (with the exception of sys.stdout) using just the encoding parameter of open(). Hence my question if some- thing like a set_encoding() method/function that would work on sys.stdout is feasible. I don't see an answer to that in your statement above.
Right, I was only trying to explain why the standard streams are a special case - because they're also used by the interpreter, and it makes the startup process much simpler if the interpreter retains complete control over the way they're initialised (it's already complicated by the fact we need to get something half-usable in place as sys.stderr so that error reporting is possible while initialising them properly). It then becomes an application level operation to replace them if desired. We can (and do) make the internal standard stream initialisation configurable, but it then becomes a UI design problem to get something that balances flexibility against complexity. PYTHONIOENCODING (in association with OS utilities that make it possible to set an environment variable for a specific process invocation, as well as support in the subprocess module for passing a tailored environment to subprocesses) is our current solution. The interpreter design aims, first and foremost, to provide a simple and straightforward experience in POSIX environments that use UTF-8 everywhere (since that's the most sane approach available for migrating from a previously ASCII-based computing world). Windows is a bit trickier (due to the internal use of UTF-16 APIs and the lack of POSIX-style support for temporarily setting an environment variable when invoking a process from the shell), but correctly supporting that environment is also a very high priority. The fallback behaviours when these situations do not apply are designed to work best on systems that are, at least somewhat *locally* consistent. The real world is complex. Eventually, our answer has to be "handle it at the application level, there are too many variations for us to support it directly at the interpreter level". Currently, any standard stream encoding related problem that can't be handled with PYTHONIOENCODING is just such a situation. We know it sucks for multi-encoding environments, but those are a nightmare for a lot of reasons and are the main drivers behind the industry-wide effort to standardise on Unicode text handling, including universal encodings like UTF-8. So now we're down to the question of how much complexity we're willing to tolerate in the interpreter specifically for the sake of environments where: 1. The automatic standard stream encoding calculation gives the wrong answer 2. The PYTHONIOENCODING override is insufficient 3. The application being executed isn't already handling the problem 4. A -m executable helper module (or directly executable helper script) can't be used to initialise the standard streams correctly before continuing on to execute the requested application via the runpy module And the answer is "not much". About the only likely way forward I can see for streamlining this situation would be to treat this as another use case for http://bugs.python.org/issue14803, which proposes the ability to run snippets of Python code prior to execution of __main__. I do agree that "create a new IO object that is like this old IO object but with these settings changed" could probably do with a better official API, but such an API needs to be designed with a respect for the issues associated with changing encodings "on the fly" and ask serious questions about whether or not we should be encouraging that practice by making it easier than it is already. I thought I had posted a tracker issue to that effect, but I can't find it now. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 06/07/2012 07:01 PM, Nick Coghlan wrote:
On Fri, Jun 8, 2012 at 10:14 AM, Rurpy <rurpy-/E1597aS9LQAvxtiuMwx3w@public.gmane.org> wrote:
On 06/07/2012 03:45 PM, Nick Coghlan wrote:
If user level code doesn't want those streams, it needs to replace them with something else.
Yes, this is what the code I googled up does: import codecs sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer) But that code is not obvious to someone who has been able to do all his encoded IO (with the exception of sys.stdout) using just the encoding parameter of open(). Hence my question if some- thing like a set_encoding() method/function that would work on sys.stdout is feasible. I don't see an answer to that in your statement above.
First, thanks for the detailed response.
Right, I was only trying to explain why the standard streams are a special case - because they're also used by the interpreter, and it makes the startup process much simpler if the interpreter retains complete control over the way they're initialised (it's already complicated by the fact we need to get something half-usable in place as sys.stderr so that error reporting is possible while initialising them properly). It then becomes an application level operation to replace them if desired.
OK, I can see that as a use-case design principle. I still don't see any hard technical reason why the same streams could not be kept and simply allow their encoding's to be reset if they haven't been used yet. In other words, does that principle provide sufficient value to compensate for ruling out several possible solutions to based on modify the current stream rather than rewrapping it?
We can (and do) make the internal standard stream initialisation configurable, but it then becomes a UI design problem to get something that balances flexibility against complexity. PYTHONIOENCODING (in association with OS utilities that make it possible to set an environment variable for a specific process invocation, as well as support in the subprocess module for passing a tailored environment to subprocesses) is our current solution.
The interpreter design aims, first and foremost, to provide a simple and straightforward experience in POSIX environments that use UTF-8 everywhere (since that's the most sane approach available for migrating from a previously ASCII-based computing world). Windows is a bit trickier (due to the internal use of UTF-16 APIs and the lack of POSIX-style support for temporarily setting an environment variable when invoking a process from the shell), but correctly supporting that environment is also a very high priority. The fallback behaviours when these situations do not apply are designed to work best on systems that are, at least somewhat *locally* consistent.
But networks, shared files systems, email, etc have all blurred the concept of localness. Just because I am running my program on a Unix machine does not mean I may not need to write files with '\n\r' line endings. Perhaps another way to view it is that Python is wrongly subsuming part of the problem space into the system space. The need to read or write disparate encodings is a function of the problem being addressed (which includes how problem data is encoded just as much as whether it is formatted as CSV or as labeled name-value pairs); it's not really a function of my local system environment.
The real world is complex. Eventually, our answer has to be "handle it at the application level, there are too many variations for us to support it directly at the interpreter level". Currently, any standard stream encoding related problem that can't be handled with PYTHONIOENCODING is just such a situation. We know it sucks for multi-encoding environments, but those are a nightmare for a lot of reasons and are the main drivers behind the industry-wide effort to standardise on Unicode text handling, including universal encodings like UTF-8.
I think "nightmare" is a little too strong. PITA maybe, particularly before one's gotten tools and environment worked out. Eventually one can get used to seeing Windows path separators displayed as yen signs in cmd.exe windows. :-) I think of it as just another annoyance imposed by the real world -- like making sure backups run exactly once a night even across dst changes.
So now we're down to the question of how much complexity we're willing to tolerate in the interpreter specifically for the sake of environments where: 1. The automatic standard stream encoding calculation gives the wrong answer 2. The PYTHONIOENCODING override is insufficient 3. The application being executed isn't already handling the problem 4. A -m executable helper module (or directly executable helper script) can't be used to initialise the standard streams correctly before continuing on to execute the requested application via the runpy module
In the options you give above, it seems to me that all (except 3, and maybe 4; I only use -m only for pdb) there seems to be an implicit assumption that there is a single encoding that needs to be determined. But that is wrong. There are three streams and each of those streams may need a different encoding. Python gets this in the case of explicitly opened files... no one would dream of having a sys.encoding setting replace the open(encoding=...) parameter. What Python is missing is that the same applies to stdin, stdout and stderr. PYTHONIOENCODING is fine for what it is; it is just not meant for my particular issue. My proposal was simply to allow your option (3) to address this. (Or more accurately, that it address this on a near equal footing to explicitly opened streams for reasons of both ease of use and python api consistency.)
And the answer is "not much". About the only likely way forward I can see for streamlining this situation would be to treat this as another use case for http://bugs.python.org/issue14803, which proposes the ability to run snippets of Python code prior to execution of __main__.
That (IIUC) would not be workable for my problem. ./myprog.py -e sjis,sjis [other options...] is acceptable. Something like: python -C 'sys.stdin=...; sys.stdout=...' myprog.py [other options...] would not be. And since you mentioned it above, nor would: python -m setstdin_sjis -m setstdout_sjis myprog.py [other options...]
I do agree that "create a new IO object that is like this old IO object but with these settings changed" could probably do with a better official API, but such an API needs to be designed with a respect for the issues associated with changing encodings "on the fly" and ask serious questions about whether or not we should be encouraging that practice by making it easier than it is already. I thought I had posted a tracker issue to that effect, but I can't find it now.
I think that being unable to easily change stream encoding before first use is orders of magnitude more important than being unable to change them on-the-fly. I mentioned the latter only because I thought it might fall out naturally from fixing the first problem, and might occasionally be useful. (I mentioned a couple cases I've encountered but even I, who am very much in favor of generality, have to admit I think the uses are rare.) I acknowledge though that even a before-first-use api (which I think could be implemented before an on-the-fly one) would have to take the possible later existence of the latter into account.

On Fri, Jun 8, 2012 at 11:57 PM, Rurpy <rurpy@yahoo.com> wrote:
On 06/07/2012 07:01 PM, Nick Coghlan wrote:
On Fri, Jun 8, 2012 at 10:14 AM, Rurpy <rurpy-/E1597aS9LQAvxtiuMwx3w@public.gmane.org> wrote:
sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer) But that code is not obvious to someone who has been able to do all his encoded IO (with the exception of sys.stdout) using just the encoding parameter of open().
Well, you could do it with sys.stdout too, if you did as part of open. Unfortunately, by the time your code comes along, it is already open -- and may well have already been written to.
OK, I can see that as a use-case design principle. I still don't see any hard technical reason why the same streams could not be kept and simply allow their encoding's to be reset if they haven't been used yet.
Unfortunately, that leads to very fragile code, that will break unexpectedly because something totally unrelated decided to write a license message to stdout.
But networks, shared files systems, email, etc have all blurred the concept of localness. Just because I am running my program on a Unix machine does not mean I may not need to write files with '\n\r' line endings.
So write a file, instead of stdout... stdin/stdout is more convenient for pipes, but most such programs do have -i and -o flags for cases like yours.
seems to be an implicit assumption that there is a single encoding that needs to be determined.
Which is reasonable; they aren't the only input/output, they are the *standard* input and output. If they have different encodings, they aren't really standard. (I have some sympathy for a more lenient encoding on stderr.)
That (IIUC) would not be workable for my problem.
./myprog.py -e sjis,sjis [other options...]
is acceptable. Something like:
python -C 'sys.stdin=...; sys.stdout=...' myprog.py [other options...]
would not be.
Tastes differ; I actually prefer the second, as more explicit.
I think that being unable to easily change stream encoding before first use is orders of magnitude more important than being unable to change them on-the-fly.
Yes, but since we're talking specifically about streams you don't start, that just makes for fragile code that breaks in the field. -jJ
participants (4)
-
Jim Jewett
-
Nathan Schneider
-
Nick Coghlan
-
Rurpy