Mailman 3 changing sys.stdout encoding - Python-ideas

newer
Re: [Python-ideas] compatibility...

changing sys.stdout encoding

older
Better support for finalization...

Rurpy

June 5, 2012

5:20 p.m.

In my first foray into Python3 I've encountered this problem: I work in a multi-language environment. I've written a number of tools, mostly command-line, that generate output on stdout. Because these tools and their output are used by various people in varying environments, the tools all have an --encoding option to provide output that meets the needs and preferences of the output's ultimate consumers. In converting them to Python3, I found the best (if not very pleasant) way to do this in Python3 was to put something like this near the top of each tool[*1]: import codecs sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer) What I want to be able to put there instead is: sys.stdout.set_encoding (opts.encoding) The former I found on the internet -- there is zero probability I could have figured that out from the Python docs. It is obscure to anyone (who has like me generally only needed to deal with .encode() and .decode()) who hasn't encountered it before or dealt much with the codecs module. It is excessively complex for what is conceptually a simple and straight-forward operation. It requires the import of the codecs module in programs that other- wise don't need it [*2], and the reading of the codecs docs (not a shining example of clarity themselves) to understand it. In short it is butt ugly relative to what I generally get in Python. Would it be feasible to provide something like .set_encoding() on textio streams? (Or make .encoding a writeable property?; it seems to intentionally be non-writeable for some reason but is that reason really unavoidable?) If doing this for textio in general is too hard, then what about encapsulating the codecs stuff above in a sys.set_encoding() function? Needing to change the encoding of a sys.std* stream is not an uncommon need and a user should not have to go through the codecs dance above to do so IMO. ---- [*1] There are other ways to change stdout's encoding but they all have problems AFAICT. PYTHONIOENCODING can't easily be changed dynamically within program. Reopening stdout as binary, or using the binary interface to text stdout, requires a explicit encode call at each write site. Overloading print() is obscure because it requires reader to notice print was overloaded. [*2] I don't mean the actual import of the codecs module which occurs anyway; I mean the extra visual and cognitive noise introduced by the presence of the import statement in the source.

Show replies by date

Stephen J. Turnbull

June 2012

7:37 p.m.

Rurpy writes:

...

It is excessively complex for what is conceptually a simple and straight-forward operation.

The operation is not conceptually straightforward. The problem is that you can't just change the encoding of an open stream, encodings are generally stateful. The straightforward way to deal with this issue is to close the stream and reinitialize it. Your proposed .set_encoding() method implies something completely different about what's going on. I wouldn't object to a method with the semantics of reinitialization, but it should have a name implying reinitialization. It probably should also error if the stream is open and has been written to.

...

I suspect needing to *change* the encoding of an open stream is generally quite rare. Needing to *initialize* the std* streams with an appropriate codec is common. That's why it doesn't so much matter that PYTHONIOENCODING can't be changed within a program. I agree that use of PYTHONIOENCODING is pretty awkward.

Amaury Forgeot d'Arc

9:22 p.m.

2012/6/5 Stephen J. Turnbull <stephen@xemacs.org>

...

What do you think of the following method TextIOWrapper.reset_encoding? (the assert statements should certainly be replaced by some IOError) :: def reset_encoding(self, encoding, errors='strict'): if self._decoder: # No decoded chars awaiting read assert self._decoded_chars_used == len(self._decoded_chars) # Nothing in the input buffer buf, flag = self._decoder.getstate() assert buf == b'' if self._encoder: # Nothing in the output buffer buf = self._encoder.encode('', final=True) assert buf == b'' # Reset the decoders self._decoder = None self._encoder = None # Now change the encoding self._encoding = encoding self._errors = errors -- Amaury Forgeot d'Arc

Stephen J. Turnbull

3:28 a.m.

Amaury Forgeot d'Arc writes:

...

I think that it's an attractive nuisance because it doesn't close the stream, and therefore permits changing the encoding without any warning partway through the stream. There are two reasonable (for a very generous definition of "reasonable"<wink/>) ways to handle multiple scripts in one stream: Unicode and ISO 2022. Simply changing encodings in the middle is a recipe for disaster in the absence of a higher-level protocol for signaling this change (that's the role ISO 2022 fulfils, but it is detested by almost everybody...). If you want to do that kind of thing, the "import codecs; sys.stdout = ..." idiom is available, but I don't see a need to make it convenient. But the OP's request is pretty clearly not for a generic .set_encoding(), it's for a more convenient way to initialize the stream for users. Aside to Victor: at least on Mac OS X, I find that Python 3.2 (current MacPorts, I can investigate further if you need it) doesn't respect the language environment as I would expect it to. "LC_ALL=ja_JP.UTF8 python32" will give me an out-of-range Unicode error if I try to input Japanese using "import sys; sys.stdin.readline()" -- I have to use "PYTHONIOENCODING=UTF8" to get useful behavior. There may also be cases where multiple users with different language needs are working at the same workstation. For both of these cases a command-line option to initialize the encoding would be convenient.

Nick Coghlan

5:49 a.m.

On Wed, Jun 6, 2012 at 1:28 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

For both of these cases a command-line option to initialize the encoding would be convenient.

Before adding yet-another-command-line-option, the cases where the existing environment variable support can't be used from the command line, but a new option could be, should be clearly enumerated. $ python3 Python 3.2.1 (default, Jul 11 2011, 18:54:42) [GCC 4.6.1 20110627 (Red Hat 4.6.1-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information.

...

Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Rurpy

6:14 a.m.

On 06/05/2012 11:49 PM, Nick Coghlan wrote:

...

A Python interpreter command line option? That would not particularly help my use case much.

...

I don't think that works on Windows.

Chris Rebert

6:32 a.m.

On Tue, Jun 5, 2012 at 11:14 PM, Rurpy <rurpy@yahoo.com> wrote:

...

You just need to use the "set" command/built-in (http://ss64.com/nt/set.html ; or the PowerShell equivalent) to set the environment variable. It's 1 extra line. Blame Windows for not being POSIXy enough. Cheers, Chris

Rurpy

7:36 a.m.

On 06/06/2012 12:32 AM, Chris Rebert wrote:

...

There's a lot more than that I blame Windows for. :-) There's another extra line to restore the environment to its original setting too. And when you forget to do that remember to straighten out the output of the next python program you run. Also, does not PYTHONIOENCODING affect all three streams? That would rule it out of consideration in my use case. But even if not, I'm sorry, compared with running a single command with an encoding option, I think messing with environment variables is not really a workable solution. About the closest I see to do this in practice would be to wrap each python program up in a .bat script. This is really case of the Python tail wagging the application dog.

Stephen J. Turnbull

8:39 a.m.

Rurpy writes:

...

You have a workable 2-line solution, which you posted. It's ugly and hard to find, and it should be, to discourage people from thinking it's something they might *want* to do. But they shouldn't; people in multilingual environments should be using UTF-8 externally unless they have really really special needs (and even then they should probably be using UTF-8 embedded in markup that serves those needs).

...

This is really case of the Python tail wagging the application dog.

If you need to do it often, just make a function out of it. It doesn't need to be a built-in.

Rurpy

2:34 a.m.

On 06/06/2012 02:39 AM, Stephen J. Turnbull wrote:

...

Please don't misunderstand why I posted... as you say, my code now works fine and I understand how to handle this problem when I encounter it in the future. I took the time to post here because it took an inordinate amount of effort to find a solution to a legitimate need (your opinion to the contrary not withstanding) and the resulting code which should have been trivially simple and obvious, wasn't. It is a minor issue but the end result of experiences like this, although infrequent, is often "WTF, why is this simple and reasonable thing so hard to do?". And after a few times some programmers will start to wonder if maybe Python is not really an industrial-strength language -- one that they can be effective all the time, even when the problem falls outside the 95% demographic. (And I am not talking about things totally out of python's scope like high performance computing or systems programming.)

...

I wanted to do it because it was the correct design choice. The suggestion that to redesign an entire existing technical and personnel infrastructure to use utf-8, is a better choice is, well, never mind. It is not the place of language designers to intentionally make it hard to solve legitimate problems. There *are* other encodings in the world, there will be for sometime to come, and some programmers will sometimes have to deal with that. Non-utf-8 encodings are not so evil (except in the minds of some zealots) that working with them conveniently should be made difficult. (I am reminded of the Unix zealots of days past who refused to deal with Windows line endings.) The way I chose to deal with the encoding requirements I had was the correct way. It's unfortunate that Python makes it uglier than it should be. The discussion seems to be going off topic for this list. I understand there is no support here for providing a non- obscure, programmatic way of changing the encoding of the standard streams at program startup and that's fine, it was a suggestion. Thank you all for the feedback.

Paul Moore

6:27 a.m.

On 7 June 2012 03:34, Rurpy <rurpy@yahoo.com> wrote:

...

One suggestion, which would probably shed some light on whether this should be viewed as something "simple and reasonable", would be to do some research on how the same task would be achieved in other languages. I have no experience to contribute but my intuition says that this could well be hard on other languages too. Would you be willing to do some web searches to look for solutions in (say) Java, or C#, or Ruby? In theory, it shouldn't take long (as otherwise you can conclude that the solution is obscure to the same extent that it is with Python). Even better, if those other languages do have a simple solution, it may suggest an approach that would be appropriate for Python. Paul.

Rurpy

9:02 p.m.

On 06/07/2012 12:27 AM, Paul Moore wrote:

...

One suggestion, which would probably shed some light on whether this should be viewed as something "simple and reasonable", would be to do some research on how the same task would be achieved in other languages.

Yes, that is a good idea. If I decide to reraise this suggestion at some point, I will try to do as you suggest.

...

I have no experience to contribute but my intuition says that this could well be hard on other languages too.

Again, I have yet to be convinced this is hard. I am very sceptical it is hard in the case of streams before they've been written or read. Replacing sys.stdout with a wrapper that encodes with the alternate encoding clearly works -- it just needs to be encapsulated so the user doesn't need to figure out all the details in order to use it.

...

Would you be willing to do some web searches to look for solutions in (say) Java, or C#, or Ruby? In theory, it shouldn't take long (as otherwise you can conclude that the solution is obscure to the same extent that it is with Python).

Even better, if those other languages do have a simple solution, it may suggest an approach that would be appropriate for Python.

Nick Coghlan

9:45 p.m.

The interpreter uses the standard streams internally, and they're one of the first things created during interpreter startup. User provided code doesn't start running until well after they're initialised. If user level code doesn't want those streams, it needs to replace them with something else. Cheers, Nick. -- Sent from my phone, thus the relative brevity :) On Jun 8, 2012 7:03 AM, "Rurpy" <rurpy@yahoo.com> wrote:

...

Stephen J. Turnbull

7:12 a.m.

Rurpy writes:

...

I don't think I said the need was illegitimate, if I did I apologize, and I certainly don't believe it is (I'm an economist by trade -- de gustibus non est disputandum). I just don't think it's necessary for Python to try to address the problem, because the problem is somebody else's bad design at root. And I don't think it would be wise to try to do it in a very general way, because it's very hard to do that at the general level of the language.

...

You're wrong. There is *some* support for that. It just has to be done safely, and that means that a generic .set_encoding() method that can be called after I/O has been performed probably isn't going to happen. And it might not happen at the core level, since a 3-line function can do the job, it might make just as much sense to put up a package on PyPI.

Rurpy

8:48 p.m.

On 06/07/2012 01:12 AM, Stephen J. Turnbull wrote:

...

I don't understand that argument. The world is full of bad design that Python has to address: daylight savings time, calendars, floating-point (according to some). Good/bad design is not even constant and changes with time. There is still a telnetlib module in stdlib despite the existence of ssh. I suspect the vast majority of programmers are interested in a language that allows them to *effectively* get done what they need to, whether they are working of the latest agile TTD REST server, or modifying some legacy text files. What I for one *don't* need is to have my programming language enforcing its idea of CS political correctness on me. Secondly, the disparity in ease of use of an alternate encoding on sts.stdout is not really between utf8 and non-utf8, it is between a default encoding (which may be non-utf8), and the encoding I wish to use. So one can't really attribute it to a desire to improve the world by making non-utf8 harder to use! And even were I to accept your argument, Python is inconsistent: when I open a file explicitly there is only a slight penalty for opening a non-default-encoded file (the need the explicitly give an encoding): f = open ("myfile", "w") # my default utf8 encoding print ("text string", file=f) vs f = open ("myfile", "w", encoding="sjis") # non-utf8 print ("text string", file=f) But for sys.stdout, the penalty for using an alternate encoding is to google around for a solution (which may not be optimal as Victor Stinner pointed out) and then read about codecs and the StreamWriter wrapper, textio wrappers and the .buffer() method. And the reading part is then repeated by all those (at the same level of python expertise) who read the program. All I can do is repeat what I said before: non-utf8 codings exist and are widely used. That's a simple fact. Sample some .jp web sites and look at the ratio of shift-jis web pages to utf-8 web pages for example. utf-8 is an encoding. shift-jis is an encoding. Sure, I understand that utf-8 is preferable and I will use it when possible. The fact that I am writing shift-jis means that utf-8 *isn't* possible in this case. Since utf-8 and shift-jis are both encodings and are equivalent from a coding viewpoint (a simple choice of which codec to use) the discrepancy in ease of use between the two in the case of writing to the standard streams is not justifiable and should be corrected if possible.

...

But is it? Or are you referring to switching encoding on-the-fly? (see below).

...

There are two sub-threads in this discussion 1) Providing a more convenient and discoverable way to programmatically change the encoding of std* streams before first use. 2) Changing the encoding used on the std* stream or any textio stream on the fly as a generalization of (1). I thought I made clear I was advocating for (1) and not (2) when I earlier wrote in reply to you:

...

As for (2), you have pointed out some potential issues with switching encodings midstream. I don't understand how codecs work in Python sufficiently yet to either agree or disagree with you. I have however questioned some of the statements made regarding its difficulty (and am holding my opinion open until I understand the issues better), but I am not (as I've stated) advocating for it now. Sorry if I failed to make the distinction clearer. My use of .set_encoding() as a placeholder for both ideas probably contributed to the confusion.

...

I wasn't suggesting a change to the core level (if by that you mean to the interpreter). I was asking if some way could be provided that is easier and more reliable than googling around for a magic incantation) to change the encoding of one or more of the already-open-when-my-program-starts sys.std* streams. I presume that would be a standard library change (in either the io or sys modules) and offered a .set_encoding() method as a placeholder for discussion. I hardly think it is worth the effort, for either the producer or consumers, of putting a 3-line function on PyPI. Nor would such a solution address the discoverability and ease-of-use problems I am complaining about. An inferior and bare minimum way to address this would be to at least add a note about how to change the encoding to the sys.std* documentation. That encourages cargo-cult programming and doesn't address the WTF effect but it is at least better than the current state of affairs.

Mike Meyer

9 p.m.

On Thu, Jun 7, 2012 at 4:48 PM, Rurpy <rurpy@yahoo.com> wrote:

...

Others have raised the question this begs to have answered: how do other programming languages deal with wanting to change the encoding of the standard IO streams? Can you show us how they do things that's so much easier than what Python does?

...

The proper encoding for the standard IO streams is generally a property of the environment, and hence is set in the environment. You have a use case where that's not the case. The argument is that your use case isn't common enough to justify changing the standard library. Can you provide evidence to the contrary? Other languages that make setting the encoding on the standard streams easy, or applications outside of those built for your system that have a "--encoding" type flag?

...

Why presume that this needs a change in the library? The method is straightforward, if somewhat ugly. Is there any reason it can't just be documented, instead of added to the library? Changing the library would require a similar documentation change. <mike

Niki Spahiev

7:23 a.m.

On 8.06.2012 00:00, Mike Meyer wrote:

...

Mercurial: ... --debug enable debugging output --debugger start debugger --encoding ENCODE set the charset encoding (default: UTF-8) --encodingmode MODE set the charset encoding mode (default: strict) --traceback always print a traceback on exception ... Niki

Simon Sapin

7:42 a.m.

Le 08/06/2012 09:23, Niki Spahiev a écrit :

...

From the man page:

...

I don’t know if this affects standard IO. -- Simon Sapin

Jim Jewett

2:21 p.m.

On Thu, Jun 7, 2012 at 5:00 PM, Mike Meyer <mwm@mired.org> wrote:

...

Agreed. The problem is that your use case gets hit by several special cases at once. Usually, you don't need to worry about encodings at all; the default is sufficient. Obviously not the case for you. Usually, the answer is just to open a file (or stream) the way you want to. sys.stdout is special because you don't open it. If you do want to change sys.stdout, usually the answer is to replace it with a different object. Apparently (though I missed the reason why) that doesn't work for you, and you need to keep using the same underlying stream. So at that point, replacing it with a wrapped version of itself probably *is* the simplest solution. The remaining problem is how to find the least bad way of doing that. Your solution does work. Adding it as an example to the docs would probably be reasonable, but someone seems to have worked pretty hard at keeping the sys module documentation short. I could personally support a wrap function on the sys.std* streams that took care of flushing before wrapping, but ... there is a cost, in that the API gets longer, and therefore harder to learn.

...

There are plenty of applications with an encoding flag; I'm not sure how often it applies to sys.std*, as opposed to named files. -jJ

Oscar Benjamin

2:25 p.m.

On 11 June 2012 15:21, Jim Jewett <jimjjewett@gmail.com> wrote:

...

I also think I missed something in this thread. At the beginning of the original thread it seemed that everyone was agreed that writer = codecs.getwriter(desired_encoding) sys.stdout = writer(sys.stdout.buffer) was a reasonable solution (with the caveat that it should happen before any output is written). Is there some reason why this is not a good approach? The only problem I know of is that under Python 2.x it becomes an error to print _already_ encoded strings (they get decoded as ascii before being encoded) but that's probably not a problem for an application that takes a disciplined approach to unicode.

...

Stephen J. Turnbull

4:58 a.m.

Oscar Benjamin writes:

...

It's undocumented and unobvious, but it's needed for standard stream filtering in some environments -- where a lot of coding is done by people who otherwise never need to understand streams at anything but a superficial level -- and the analogous case of a newly opened file, pipe, or socket is documented and obvious, and usable by novices. It's damn shame that we can't say the same about the stdin, stdout, and stderr streams (even if I too have been at pains to explain why that's hard to fix).

Guido van Rossum

5:21 a.m.

On Tue, Jun 12, 2012 at 9:58 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

I'm probably missing something, but in all my naivete I have what feels like a simple solution, and I can't seem to see what's wrong with it. In C there used to be a function to set the buffer size on an open stream that could only be called when the stream hadn't been used yet. ISTM the OP's use case would be covered by a similar function on an open TextIOWrapper to set the encoding that can only be used when it hasn't been used to write (or read) anything yet? When called under any other circumstances it should raise an error. The TextIOWrapper should maintain a "used" flag so that it can raise this exception reliably. This ought to work for stdin and stdout when used at the start of the program, assuming nothing is written by code run before main starts. (This should normally be fine, otherwise you couldn't use a Python program as a filter at all.) It won't work for stderr if connected to a tty-ish device (since the version stuff is written there) but that should be okay, and it should still be okay with stderr if it's not a tty, since then it starts silent. (But I don't think the use case is very strong for stderr anyway.) I'm not sure about a name, but it might well be called set_encoding(). The error message when misused should clarify to people who misunderstand the name that it can only be called when the stream hasn't been used yet; I don't think it's necessary to encode that information in the name. (C's setbuf() wasn't called set_buffer_on_virgin_stream() either. :-) I don't care about the integrity of the underlying binary stream. It's a binary stream, you can write whatever bytes you want to it. But if a TextIOWrapper is used properly, it won't write a mixture of encodings to the underlying binary stream, since you can only set the encoding before reading/writing a single byte. (And the TextIOWrapper is careful not to use the binary stream before the first actual read() or write() call -- it just tries to calls tell(), if it's seekable, which should be safe.) -- --Guido van Rossum (python.org/~guido)

Nick Coghlan

5:42 a.m.

On Wed, Jun 13, 2012 at 3:21 PM, Guido van Rossum <guido@python.org> wrote:

...

I think you're right, and such a method in combination with stream.buffer.peek() should actually handle a lot of encoding detection cases, too. The alternative approaches (calling TextIOWrapper on stream.detach(), or open on stream.fileno()) either break any references to the old stream or else create two independent IO stacks on top of a single underlying file descriptor, which may create some odd behaviour. Being able to set the encoding on a previously unused stream would also interact better with the existing subprocess PIPE API. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Stephen J. Turnbull

8:35 a.m.

Guido van Rossum writes:

...

I'm not sure about a name, but it might well be called set_encoding().

I would still prefer "initialize_encoding" or something like that, but the main thing I was worried about was a "consenting adults" function that shouldn't be called after I/O, but *could* be.

INADA Naoki

August 2012

12:51 a.m.

On Wed, Jun 13, 2012 at 5:35 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

I still don't understand why Python can't support using it after I/O. Is this code wrong? https://gist.github.com/3280063 -- INADA Naoki <songofacandy@gmail.com>

Victor Stinner

10:20 a.m.

The write buffer can be flushed, so I don't see the problem of changing the encoding of stdout and stderr (except potential mojibake). For stdin, TextIOWrapper has a readahead algorithm, so changing the encoding may seek backward. It cannot be done if stdin is not seekable (ex: if stdin is a pipe). I wrote a Python implementation of set_encoding, see my patch attached to the issue. http://bugs.python.org/15216 Victor Le 7 août 2012 02:52, "INADA Naoki" <songofacandy@gmail.com> a écrit :

...

Stephen J. Turnbull

June 2012

11:11 a.m.

Rurpy writes:

...

Python is inconsistent:

Yup, and I said there is support for dealing with that inconsistency. At least I'm +1 and Nick's +0.5. So let's talk about what to do about it. Nick has a pretty good channel on the BFDL, and since he doesn't seem to like an addition to the stdlib here, it may not go far. But I don't see a reason to rule out stdlib changes yet. As far as I'm concerned, there are three reasonable proposals:

...

...
[S]ince a 3-line function can do the job, it might make just as much sense to put up a package on PyPI.

...

I hardly think it is worth the effort, for either the producer or consumers, of putting a 3-line function on PyPI. Nor would such a solution address the discoverability and ease-of-use problems I am complaining about.

Agreed that it's pretty weak, but it's not clear that other solutions will be much better in practice. Discoverability depends on documentation, which can be written and improved. I think "ease of use" is way off-target.

...

I presume that would be a standard library change (in either the io or sys modules) and offered a .set_encoding() method as a placeholder for discussion.

Changing the stdlib is not a panacea. In particular, it can't be applied to older Pythons. I'm also not convinced (cf. Nick's post) that there's enough value-added and a good name for the restricted functionality we know we can provide.

...

An inferior and bare minimum way to address this would be to at least add a note about how to change the encoding to the sys.std* documentation. That encourages cargo-cult programming and doesn't address the WTF effect but it is at least better than the current state of affairs.

IMO, this may be the best, but again I doubt it can be added to older versions. As for the "cargo cult" and "WTF" issues, I have little sympathy for either. The real WTF problem is that multi-encoding environments are inherently complex and irregular (ie, a WTF waiting to happen), and Python can't fix that. It's very unlikely that typical programmers will bother to understand what happens "under the hood" of a stdlib function/method, so that is no better than cargo-cult programming (and cargo-cult at least has the advantage that what is being done is explicit, allowing programmers who understand textio but not encodings to figure out what's happening).

Rurpy

4:07 a.m.

On 06/08/2012 05:11 AM, Stephen J. Turnbull wrote:

...

Rurpy writes:

...
Python is inconsistent:

Yup, and I said there is support for dealing with that inconsistency. At least I'm +1 and Nick's +0.5.

So let's talk about what to do about it. Nick has a pretty good channel on the BFDL, and since he doesn't seem to like an addition to the stdlib here, it may not go far. But I don't see a reason to rule out stdlib changes yet.

As far as I'm concerned, there are three reasonable proposals:

Which were (summarizing, please correct if wrong) 1) A package on PyPI containing a function like import codecs def rewrap_stream_with_new_encoding (old_stream, encoding): new_stream = codecs.getwriter (encoding)(old_stream.buffer) return new_stream (or maybe three functions for each of the std* streams, without the 'old_stream' parameter?) 2) Modify standard lib. Add something like a .reset_encoding() method to io.TextIOWrapper? (Name and functionality to be bikeshedded to death.) 3) Modify the standard lib documentation (I assume for sys.std* as described below) Also 4?) Nathan Schneider suggested a hybrid (1) and (2): put the function in the codecs module.

...

...
...
[S]ince a 3-line function can do the job, it might make just as much sense to put up a package on PyPI.

...
I hardly think it is worth the effort, for either the producer or consumers, of putting a 3-line function on PyPI. Nor would such a solution address the discoverability and ease-of-use problems I am complaining about.

Agreed that it's pretty weak, but it's not clear that other solutions will be much better in practice.

If (and when) I had the problem of figuring out how to change sys.stdout encoding PyPI would be (and was) the last place I'd look. It is just not the kind of problem one looks to a package to solve. Rather like looking in PyPI if you want to capitalize a string. Where I would look is where I did: * The Python docs io module. * Then the sys module docs for std*. They say how to change the buffering and how to change to binary. They also say how the default encoding is determined. For this reason, this is where I would put any note about changing the encoding. * Finally the internet. * Had I not found an answer there I would have posted to c.l.p. I don't think I'd have looked on PyPI unless something explicitly pointed me there.

...

Discoverability depends on documentation, which can be written and improved.

Documentation where?

...

I think "ease of use" is way off-target.

I would think ease of use would always be a consideration in any api change users were exposed to. Or are you saying some api's should be discouraged and making them hard to use is better than a "not recommended" note in the documentation? If so I suspect we'll just have to agree to disagree on that. And in this case I don't even see any reason to disrecommend it -- writing to sys.stdout is the best answer in the circumstances I've described.

...

...
I presume that would be a standard library change (in either the io or sys modules) and offered a .set_encoding() method as a placeholder for discussion.

Changing the stdlib is not a panacea. In particular, it can't be applied to older Pythons. I'm also not convinced (cf. Nick's post) that there's enough value-added and a good name for the restricted functionality we know we can provide.

Nothing is ever a panacea. It seems like it could be the cleanest, nicest (long term) solution but clearly the most difficult.

...

...
An inferior and bare minimum way to address this would be to at least add a note about how to change the encoding to the sys.std* documentation. That encourages cargo-cult programming and doesn't address the WTF effect but it is at least better than the current state of affairs.

IMO, this may be the best, but again I doubt it can be added to older versions.

Does it need to be? I'd have thought this would just be a doc issue on the tracker (although perhaps getting agreement of the wording would be hard?)

...

As for the "cargo cult" and "WTF" issues, I have little sympathy for either. The real WTF problem is that multi-encoding environments are inherently complex and irregular (ie, a WTF waiting to happen), and Python can't fix that.

But the WTF comes not from multi-encoding (in which case it would have occurred when the problem requirements were received) but from observing that doing the necessary output to a file is easy as pie, but doing the same to stdout (another file) isn't. Python can avoid making a less than ideal situation (multi-encoding) worse by not making harder to do what needs to be done than necessary.

...

It's very unlikely that typical programmers will bother to understand what happens "under the hood" of a stdlib function/method, so that is no better than cargo-cult programming

The point though is that programmers don't need to look under the hood -- the fact that something is in stdlib means (at least ideally) it is documented as a black box. What goes in, what comees out, the relationship between the two and any side effects are all concisely, fully and accurately described (again, in an ideal world). But with a code snippet and a comment that says, "use this to change the encoding of sys.stdout), the programmer has to figure out everything himself. (Of course that's not totally bad -- I know a lot more about text IO streams than I did 3 days ago. :-) Sure, you could document the code snippet as well as a packaged function, but that's stretching our ideal world well past the breaking point -- it doesn't happen. :-)

...

(and cargo-cult at least has the advantage that what is being done is explicit, allowing programmers who understand textio but not encodings to figure out what's happening).

True it's a double edged sword but I prefer to use code packaged in stdlib. If I didn't I would cut and paste from there and I don't :-) Also, there are programmers who understand encoding but not textio (I'm one) but I'll concede we are probably a minority.

Stephen J. Turnbull

3:36 p.m.

Rurpy writes:

...

No, I'm saying "explicit is better than implicit". It's not hard to use the explicit idiom, and it makes it clear that there are two *different* kinds of problem that could occur, which would be concealed by an API.

Masklinn

August 2012

10:35 a.m.

On 2012-06-06, at 07:49 , Nick Coghlan wrote:

...

LC_CTYPE also works, and is not specific to Python.

Rurpy

June 2012

6:05 a.m.

On 06/05/2012 01:37 PM, Stephen J. Turnbull wrote:

...

Rurpy writes:

...
It is excessively complex for what is conceptually a simple and straight-forward operation.

The operation is not conceptually straightforward. The problem is that you can't just change the encoding of an open stream, encodings are generally stateful. The straightforward way to deal with this issue is to close the stream and reinitialize it. Your proposed .set_encoding() method implies something completely different about what's going on.

I'm not sure why stateful matters. When you change encoding you discard whatever state exists and start with the new encoder in it's initial state. If there is a partially en/decoded character then wouldn't do the same thing you'd do if the same condition arose at EOF?

...

I wouldn't object to a method with the semantics of reinitialization, but it should have a name implying reinitialization. It probably should also error if the stream is open and has been written to.

...
Needing to change the encoding of a sys.std* stream is not an uncommon need and a user should not have to go through the codecs dance above to do so IMO.

I suspect needing to *change* the encoding of an open stream is generally quite rare. Needing to *initialize* the std* streams with an appropriate codec is common. That's why it doesn't so much matter that PYTHONIOENCODING can't be changed within a program.

You are correct that my current concern is reinitializing the encoding(s) of the sys.std* streams prior to doing any operations with them. I thought that changing the encoding at any point would be a straight-forward generalization. However I have in the past encountered mixed encoding outputting programs in two contexts; generating test data (i think is was for automatic detection and extraction of information), and bundling multiple differently-encoded data sets in one package that were pulled apart again downstream That both uses probably could have been designed better is irrelevant; a hypothetical python programmer's job would have been to produce a python program that would fit into the the existing processes. However I don't want to dwell on this because it is not my main concern now, I thought I would just mention it for the record.

...

I agree that use of PYTHONIOENCODING is pretty awkward.

Stephen J. Turnbull

8:26 a.m.

Rurpy writes:

...

I'm not sure why stateful matters. When you change encoding you discard whatever state exists

How do you know what *I* want to do? Silently discarding buffer contents would suck.

...

If there is a partially en/decoded character then wouldn't do the same thing you'd do if the same condition arose at EOF?

Again speaking for *myself*, almost certainly not. On input, if it happens *before* EOF it's incomplete input, and I should wait for it to be completed. If it happens on output, there's a bug somewhere, and I probably want to do some kind of error recovery.

...

However I have in the past encountered mixed encoding outputting programs in two contexts; generating test data (i think is was for automatic detection and extraction of information), and bundling multiple differently-encoded data sets in one package that were pulled apart again downstream.

That both uses probably could have been designed better is irrelevant; a hypothetical python programmer's job would have been to produce a python program that would fit into the the existing processes.

No, it's not irrelevant that it's bad design. Python should not go out of its way to cater to bad design, if bad design can be worked around with existing facilities. Here there are at least two ways to do it: the method of changing sys.std*'s text encoding that you posted, and switching sys.std* to binary and doing explicit encoding and decoding of strings to be input or output. I have also encountered mixed encoding, in my students' filesystems (it was not uncommon to see /home/j.r.exchangestudent/KOI8-R/SHIFT_JIS and similar). That doesn't mean it should be made easier to generate!

Victor Stinner

11:34 p.m.

2012/6/5 Rurpy <rurpy@yahoo.com>:

...

In my first foray into Python3 I've encountered this problem: I work in a multi-language environment. I've written a number of tools, mostly command-line, that generate output on stdout. Because these tools and their output are used by various people in varying environments, the tools all have an --encoding option to provide output that meets the needs and preferences of the output's ultimate consumers.

What happens if the specified encoding is different than the encoding of the console? Mojibake? If the output is used as in the input of another program, does the other program use the same encoding? In my experience, using an encoding different than the locale encoding for input/output (stdout, environment variables, command line arguments, etc.) causes various issues. So I'm curious of your use cases.

...

In converting them to Python3, I found the best (if not very pleasant) way to do this in Python3 was to put something like this near the top of each tool[*1]:

import codecs sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer)

In Python 3, you should use io.TextIOWrapper instead of codecs.StreamWriter. It's more efficient and has less bugs.

...

What I want to be able to put there instead is:

sys.stdout.set_encoding (opts.encoding)

I don't think that your use case merit a new method on io.TextIOWrapper: replacing sys.stdout does work and should be used instead. TextIOWrapper is generic and your use case if specific to sys.std* streams. It would be surprising to change the encoding of an arbitrary file after it is opened. At least, I don't see the use case. For example, tokenize.open() opens a Python source code file with the right encoding. It starts by reading the file in binary mode to detect the encoding, and then use TextIOWrapper to get a text file without having to reopen the file. It would be possible to start with a text file and then change the encoding, but it would be less elegant.

...

sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer)

You should also flush sys.stdout (and maybe also sys.stdout.buffer) before replacing it.

...

It requires the import of the codecs module in programs that other- wise don't need it [*2], and the reading of the codecs docs (not a shining example of clarity themselves) to understand it.

It's maybe difficult to change the encoding of sys.stdout at runtime because it is NOT a good idea :-)

...

Needing to change the encoding of a sys.std* stream is not an uncommon need and a user should not have to go through the codecs dance above to do so IMO.

Replacing sys.std* works but has issues: output written before the replacement is encoded to a different encoding for example. The best way is to change your locale encoding (using LC_ALL, LC_CTYPE or LANG environment variable on UNIX), or simply to set PYTHONIOENCODING environment variable.

...

[*1] There are other ways to change stdout's encoding but they all have problems AFAICT. PYTHONIOENCODING can't easily be changed dynamically within program.

Ah? Detect if PYTHONIOENCODING is present (or if sys.stdout.encoding is the requested encoding), if not: restart the program with PYTHONIOENCODING=encoding.

...

Overloading print() is obscure because it requires reader to notice print was overloaded.

Why not writing the output into a file, instead of stdout? Victor

MRAB

11:56 p.m.

On 06/06/2012 00:34, Victor Stinner wrote:

...

[snip] And if you _do_ want multiple encodings in a file, it's clearer to open the file as binary and then explicitly encode to bytes and write _that_ to the file.

Stephen J. Turnbull

June 2012

7:37 p.m.

Rurpy writes:

...

It is excessively complex for what is conceptually a simple and straight-forward operation.

...

Amaury Forgeot d'Arc

9:22 p.m.

2012/6/5 Stephen J. Turnbull <stephen@xemacs.org>

...

Stephen J. Turnbull

3:28 a.m.

Amaury Forgeot d'Arc writes:

...

Nick Coghlan

5:49 a.m.

On Wed, Jun 6, 2012 at 1:28 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

For both of these cases a command-line option to initialize the encoding would be convenient.

...

Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Rurpy

6:14 a.m.

On 06/05/2012 11:49 PM, Nick Coghlan wrote:

...

A Python interpreter command line option? That would not particularly help my use case much.

...

I don't think that works on Windows.

Chris Rebert

6:32 a.m.

On Tue, Jun 5, 2012 at 11:14 PM, Rurpy <rurpy@yahoo.com> wrote:

...

Rurpy

June 2012

7:36 a.m.

On 06/06/2012 12:32 AM, Chris Rebert wrote:

...

Stephen J. Turnbull

8:39 a.m.

Rurpy writes:

...

This is really case of the Python tail wagging the application dog.

If you need to do it often, just make a function out of it. It doesn't need to be a built-in.

Rurpy

2:34 a.m.

On 06/06/2012 02:39 AM, Stephen J. Turnbull wrote:

...

Paul Moore

6:27 a.m.

On 7 June 2012 03:34, Rurpy <rurpy@yahoo.com> wrote:

...

Rurpy

9:02 p.m.

On 06/07/2012 12:27 AM, Paul Moore wrote:

...

One suggestion, which would probably shed some light on whether this should be viewed as something "simple and reasonable", would be to do some research on how the same task would be achieved in other languages.

Yes, that is a good idea. If I decide to reraise this suggestion at some point, I will try to do as you suggest.

...

I have no experience to contribute but my intuition says that this could well be hard on other languages too.

...

Would you be willing to do some web searches to look for solutions in (say) Java, or C#, or Ruby? In theory, it shouldn't take long (as otherwise you can conclude that the solution is obscure to the same extent that it is with Python).

Even better, if those other languages do have a simple solution, it may suggest an approach that would be appropriate for Python.

Nick Coghlan

9:45 p.m.

...

Stephen J. Turnbull

June 2012

7:12 a.m.

Rurpy writes:

...

Rurpy

8:48 p.m.

On 06/07/2012 01:12 AM, Stephen J. Turnbull wrote:

...

But is it? Or are you referring to switching encoding on-the-fly? (see below).

...

Mike Meyer

9 p.m.

On Thu, Jun 7, 2012 at 4:48 PM, Rurpy <rurpy@yahoo.com> wrote:

...

Niki Spahiev

7:23 a.m.

On 8.06.2012 00:00, Mike Meyer wrote:

...

Simon Sapin

7:42 a.m.

Le 08/06/2012 09:23, Niki Spahiev a écrit :

...

From the man page:

...

I don’t know if this affects standard IO. -- Simon Sapin

Jim Jewett

2:21 p.m.

On Thu, Jun 7, 2012 at 5:00 PM, Mike Meyer <mwm@mired.org> wrote:

...

There are plenty of applications with an encoding flag; I'm not sure how often it applies to sys.std*, as opposed to named files. -jJ

Oscar Benjamin

June 2012

2:25 p.m.

On 11 June 2012 15:21, Jim Jewett <jimjjewett@gmail.com> wrote:

...

Stephen J. Turnbull

4:58 a.m.

Oscar Benjamin writes:

...

Guido van Rossum

5:21 a.m.

On Tue, Jun 12, 2012 at 9:58 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

Nick Coghlan

5:42 a.m.

On Wed, Jun 13, 2012 at 3:21 PM, Guido van Rossum <guido@python.org> wrote:

...

Stephen J. Turnbull

8:35 a.m.

Guido van Rossum writes:

...

I'm not sure about a name, but it might well be called set_encoding().

I would still prefer "initialize_encoding" or something like that, but the main thing I was worried about was a "consenting adults" function that shouldn't be called after I/O, but *could* be.

INADA Naoki

August 2012

12:51 a.m.

On Wed, Jun 13, 2012 at 5:35 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

I still don't understand why Python can't support using it after I/O. Is this code wrong? https://gist.github.com/3280063 -- INADA Naoki <songofacandy@gmail.com>

Victor Stinner

August 2012

10:20 a.m.

...

Stephen J. Turnbull

June 2012

11:11 a.m.

Rurpy writes:

...

Python is inconsistent:

...

...
[S]ince a 3-line function can do the job, it might make just as much sense to put up a package on PyPI.

...

I hardly think it is worth the effort, for either the producer or consumers, of putting a 3-line function on PyPI. Nor would such a solution address the discoverability and ease-of-use problems I am complaining about.

...

I presume that would be a standard library change (in either the io or sys modules) and offered a .set_encoding() method as a placeholder for discussion.

...

An inferior and bare minimum way to address this would be to at least add a note about how to change the encoding to the sys.std* documentation. That encourages cargo-cult programming and doesn't address the WTF effect but it is at least better than the current state of affairs.

Rurpy

4:07 a.m.

On 06/08/2012 05:11 AM, Stephen J. Turnbull wrote:

...

Rurpy writes:

...
Python is inconsistent:

Yup, and I said there is support for dealing with that inconsistency. At least I'm +1 and Nick's +0.5.

So let's talk about what to do about it. Nick has a pretty good channel on the BFDL, and since he doesn't seem to like an addition to the stdlib here, it may not go far. But I don't see a reason to rule out stdlib changes yet.

As far as I'm concerned, there are three reasonable proposals:

...

...
...
[S]ince a 3-line function can do the job, it might make just as much sense to put up a package on PyPI.

...
I hardly think it is worth the effort, for either the producer or consumers, of putting a 3-line function on PyPI. Nor would such a solution address the discoverability and ease-of-use problems I am complaining about.

Agreed that it's pretty weak, but it's not clear that other solutions will be much better in practice.

...

Discoverability depends on documentation, which can be written and improved.

Documentation where?

...

I think "ease of use" is way off-target.

...

...
I presume that would be a standard library change (in either the io or sys modules) and offered a .set_encoding() method as a placeholder for discussion.

Changing the stdlib is not a panacea. In particular, it can't be applied to older Pythons. I'm also not convinced (cf. Nick's post) that there's enough value-added and a good name for the restricted functionality we know we can provide.

Nothing is ever a panacea. It seems like it could be the cleanest, nicest (long term) solution but clearly the most difficult.

...

...
An inferior and bare minimum way to address this would be to at least add a note about how to change the encoding to the sys.std* documentation. That encourages cargo-cult programming and doesn't address the WTF effect but it is at least better than the current state of affairs.

IMO, this may be the best, but again I doubt it can be added to older versions.

Does it need to be? I'd have thought this would just be a doc issue on the tracker (although perhaps getting agreement of the wording would be hard?)

...

As for the "cargo cult" and "WTF" issues, I have little sympathy for either. The real WTF problem is that multi-encoding environments are inherently complex and irregular (ie, a WTF waiting to happen), and Python can't fix that.

...

It's very unlikely that typical programmers will bother to understand what happens "under the hood" of a stdlib function/method, so that is no better than cargo-cult programming

...

(and cargo-cult at least has the advantage that what is being done is explicit, allowing programmers who understand textio but not encodings to figure out what's happening).

Stephen J. Turnbull

3:36 p.m.

Rurpy writes:

...

Masklinn

August 2012

10:35 a.m.

On 2012-06-06, at 07:49 , Nick Coghlan wrote:

...

LC_CTYPE also works, and is not specific to Python.

Rurpy

June 2012

6:05 a.m.

On 06/05/2012 01:37 PM, Stephen J. Turnbull wrote:

...

Rurpy writes:

...
It is excessively complex for what is conceptually a simple and straight-forward operation.

The operation is not conceptually straightforward. The problem is that you can't just change the encoding of an open stream, encodings are generally stateful. The straightforward way to deal with this issue is to close the stream and reinitialize it. Your proposed .set_encoding() method implies something completely different about what's going on.

...

I wouldn't object to a method with the semantics of reinitialization, but it should have a name implying reinitialization. It probably should also error if the stream is open and has been written to.

...
Needing to change the encoding of a sys.std* stream is not an uncommon need and a user should not have to go through the codecs dance above to do so IMO.

I suspect needing to *change* the encoding of an open stream is generally quite rare. Needing to *initialize* the std* streams with an appropriate codec is common. That's why it doesn't so much matter that PYTHONIOENCODING can't be changed within a program.

...

I agree that use of PYTHONIOENCODING is pretty awkward.

Stephen J. Turnbull

June 2012

8:26 a.m.

Rurpy writes:

...

I'm not sure why stateful matters. When you change encoding you discard whatever state exists

How do you know what *I* want to do? Silently discarding buffer contents would suck.

...

If there is a partially en/decoded character then wouldn't do the same thing you'd do if the same condition arose at EOF?

...

However I have in the past encountered mixed encoding outputting programs in two contexts; generating test data (i think is was for automatic detection and extraction of information), and bundling multiple differently-encoded data sets in one package that were pulled apart again downstream.

That both uses probably could have been designed better is irrelevant; a hypothetical python programmer's job would have been to produce a python program that would fit into the the existing processes.

Victor Stinner

11:34 p.m.

2012/6/5 Rurpy <rurpy@yahoo.com>:

...

In my first foray into Python3 I've encountered this problem: I work in a multi-language environment. I've written a number of tools, mostly command-line, that generate output on stdout. Because these tools and their output are used by various people in varying environments, the tools all have an --encoding option to provide output that meets the needs and preferences of the output's ultimate consumers.

...

In converting them to Python3, I found the best (if not very pleasant) way to do this in Python3 was to put something like this near the top of each tool[*1]:

import codecs sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer)

In Python 3, you should use io.TextIOWrapper instead of codecs.StreamWriter. It's more efficient and has less bugs.

...

What I want to be able to put there instead is:

sys.stdout.set_encoding (opts.encoding)

...

sys.stdout = codecs.getwriter(opts.encoding)(sys.stdout.buffer)

You should also flush sys.stdout (and maybe also sys.stdout.buffer) before replacing it.

...

It requires the import of the codecs module in programs that other- wise don't need it [*2], and the reading of the codecs docs (not a shining example of clarity themselves) to understand it.

It's maybe difficult to change the encoding of sys.stdout at runtime because it is NOT a good idea :-)

...

Needing to change the encoding of a sys.std* stream is not an uncommon need and a user should not have to go through the codecs dance above to do so IMO.

...

[*1] There are other ways to change stdout's encoding but they all have problems AFAICT. PYTHONIOENCODING can't easily be changed dynamically within program.

Ah? Detect if PYTHONIOENCODING is present (or if sys.stdout.encoding is the requested encoding), if not: restart the program with PYTHONIOENCODING=encoding.

...

Overloading print() is obscure because it requires reader to notice print was overloaded.

Why not writing the output into a file, instead of stdout? Victor

MRAB

11:56 p.m.

On 06/06/2012 00:34, Victor Stinner wrote:

...

[snip] And if you _do_ want multiple encodings in a file, it's clearer to open the file as binary and then explicitly encode to bytes and write _that_ to the file.

4584

Age (days ago)

4647

Last active (days ago)

List overview

Download

33 comments

16 participants

participants (16)

Amaury Forgeot d'Arc
Chris Rebert
Guido van Rossum
INADA Naoki
Jim Jewett
Masklinn
Mike Meyer
MRAB
Nick Coghlan
Niki Spahiev
Oscar Benjamin
Paul Moore
Rurpy
Simon Sapin
Stephen J. Turnbull
Victor Stinner

changing sys.stdout encoding

Niki Spahiev

Simon Sapin

Niki Spahiev

Simon Sapin

tags

participants (16)