
Greetings, Today, while trying to internationalize a program I'm working on, I found an interesting side-effect of how we're dealing with encoding of unicode strings while being written to files. Suppose the following example: # -*- encoding: iso-8859-1 -*- print u"á" This will correctly print the string 'á', as expected. Now, what surprises me, is that the following code won't work in an equivalent way (unless using sys.setdefaultencoding()): # -*- encoding: iso-8859-1 -*- import sys sys.stdout.write(u"á\n") This will raise the following error: Traceback (most recent call last): File "asd.py", line 3, in ? sys.stdout.write(u"á") UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0:ordinal not in range(128) This difference may become a really annoying problem when trying to internationalize programs, since it's usual to see third-party code dealing with sys.stdout, instead of using 'print'. The standard optparse module, for instance, has a reference to sys.stdout which is used in the default --help handling mechanism. Given the fact that files have an 'encoding' parameter, and that any unicode strings with characters not in the 0-127 range will raise an exception if being written to files, isn't it reasonable to respect the 'encoding' attribute whenever writing data to a file? The workaround for that problem is to either use the evil-considered sys.setdefaultencoding(), or to wrap sys.stdout. IMO, both options seem unreasonable for such a common idiom. -- Gustavo Niemeyer http://niemeyer.net

On Nov 29, 2004, at 2:04 PM, Gustavo Niemeyer wrote:
That doesn't work here, where sys.getdefaultencoding() is 'ascii', as expected.
That's expected.
No, because you don't know it's a file. You're calling a function with a unicode object. The function doesn't know that the object was some unicode object that came from a source file of some particular encoding.
There's no guaranteed correlation whatsoever between the claimed encoding of your source document and the encoding of the user's terminal, why do you want there to be? What if you have some source files with 'foo' encoding and others with 'bar' encoding? What about ascii encoded source documents that use escape sequences to represent non-ascii characters? What you want doesn't make any sense so long as python strings and file objects deal in bytes not characters :) Wrapping sys.stdout is the ONLY reasonable solution. This is the idiom that I use. It's painless and works quite well: import sys import codecs sys.stdout = codecs.getwriter('utf-8')(sys.stdout) -bob

Hello Bob, [...]
I don't understand what you're saying here. The file knows itself is a file. The write function knows the parameter is unicode.
I don't. I want the write() function of file objects to respect the encoding attribute of these objects. This is already being done when print is used. I'm proposing to extend that behavior to the write function. That's all.
Please, take a long breath, and read my message again. :-)
Wrapping sys.stdout is the ONLY reasonable solution. [...]
No, it's not. But I'm glad to know other people is also doing workarounds for that problem. -- Gustavo Niemeyer http://niemeyer.net

Gustavo Niemeyer wrote:
In general, files don't have an encoding parameter - sys.stdout is an exception. The reason why this works for print and not for write is that I considered "print unicodeobject" important, and wanted to implement that. file.write is an entirely different code path, so it doesn't currently consider Unicode objects; instead, it only supports strings (or, more generally, buffers).
Apparently, it isn't important enough that somebody had analysed this, and offered a patch. In any case, it would be quite unreliable to pass unicode strings to .write even *if* .write supported .encoding, since most files don't have .encoding. Even sys.stdout does not always have .encoding - only when it is a terminal, and only if we managed to find out what the encoding of the terminal is. Regards, Martin

That's the only case I'd like to solve. If there are platforms that don't know how to set it, we could make the encoding attribute writable, and that would allow people to easily set it to the encoding which is deemed correct in their systems.
I understand your reasoning behind it, and would like to extend your idea to the write function, allowing anyone to use the common sys.stdout idiom to implement print-like functionality (like optparse and many others). For normal files, the absence of the encoding parameter would ensure the current behavior.
That's what I'm doing here! :-)
I think that's acceptable. The encoding parameter is meant for output streams, and Python does its best to try to find a reasonable value for showing output strings. Thanks for your answer and clarifications, -- Gustavo Niemeyer http://niemeyer.net

Gustavo Niemeyer wrote:
You are mixing things here: The source encoding is meant for the parser and defines the way Unicode literals are converted into Unicode objects. The encoding used on the stdout stream doesn't have anything to do with the source code encoding and has to be handled differently. The idiom presented by Bob is the right way to go: wrap sys.stdout with a StreamEncoder. Using sys.setdefaultencoding() is *not* the right solution to the problem. In general when writing programs that are targetted for i18n, you should use Unicode for all text data and convert from Unicode to 8-bit only at the IO/UI layer. The various wrappers in the codecs module make this rather easy. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 30 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

[...]
Sorry. I probably wasn't clear enough in my message. I understand the issue, and I'm not discussing source encoding at all. The only problem I'd like to solve is that of output streams not being able to have unicode strings written.
The idiom presented by Bob is the right way to go: wrap sys.stdout with a StreamEncoder.
I don't see that as a good solution, since every Python software that is internationalizaed will have do figure out this wrapping, introducing extra overhead unnecessarily.
Using sys.setdefaultencoding() is *not* the right solution to the problem.
I understand.
That's what I think as well. I just would expect that Python was kind enough to allow me to tell which output encoding I want, instead of wrapping the sys.stdout object with a non-native-file. IOW, being widely necessary, handling internationalization without wrapping sys.stdout everytime seems like a good step for a language like Python.
The various wrappers in the codecs module make this rather easy.
Thanks for the suggestion! -- Gustavo Niemeyer http://niemeyer.net

Gustavo Niemeyer wrote:
This wrapping is probably necessary for stateful encodings. If you had a sys.stdout.encoding=="utf-16", print would probably add the BOM every time a unicode object is printed. This doesn't happen if you wrap sys.stdout in a StreamWriter.
You can't have stateful encodings without something that keeps state. The only thing that does keep state in Python is a StreamReader/StreamWriter. Bye, Walter Dörwald

Hello Walter,
I'm not sure this is an issue for a terminal output stream, which is the case I'm trying to find a solution for. Otherwise, Python would already be in trouble for using this scheme in the print statement. Can you show an example of the print statement not working? -- Gustavo Niemeyer http://niemeyer.net

Gustavo Niemeyer wrote:
I don't see any unnecessary overhead and using the wrappers is really easy, e.g.: # # Application uses Latin-1 for I/O, terminal uses UTF-8 # import codecs, sys # Make stdout translate Latin-1 output into UTF-8 output sys.stdout = codecs.EncodedFile(sys.stdout, 'latin-1', 'utf-8') # Have stdin translate Latin-1 input into UTF-8 input sys.stdin = codecs.EncodedFile(sys.stdin, 'utf-8', 'latin-1') We should probably extend the support in StreamRecoder (which is used by the above EncodedFile helper) to also support Unicode input to .write() and have a special codec 'unicode' that converts Unicode to Unicode, so that you can request the EncodedFile object to return Unicode for .read(). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 01 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On Nov 29, 2004, at 2:04 PM, Gustavo Niemeyer wrote:
That doesn't work here, where sys.getdefaultencoding() is 'ascii', as expected.
That's expected.
No, because you don't know it's a file. You're calling a function with a unicode object. The function doesn't know that the object was some unicode object that came from a source file of some particular encoding.
There's no guaranteed correlation whatsoever between the claimed encoding of your source document and the encoding of the user's terminal, why do you want there to be? What if you have some source files with 'foo' encoding and others with 'bar' encoding? What about ascii encoded source documents that use escape sequences to represent non-ascii characters? What you want doesn't make any sense so long as python strings and file objects deal in bytes not characters :) Wrapping sys.stdout is the ONLY reasonable solution. This is the idiom that I use. It's painless and works quite well: import sys import codecs sys.stdout = codecs.getwriter('utf-8')(sys.stdout) -bob

Hello Bob, [...]
I don't understand what you're saying here. The file knows itself is a file. The write function knows the parameter is unicode.
I don't. I want the write() function of file objects to respect the encoding attribute of these objects. This is already being done when print is used. I'm proposing to extend that behavior to the write function. That's all.
Please, take a long breath, and read my message again. :-)
Wrapping sys.stdout is the ONLY reasonable solution. [...]
No, it's not. But I'm glad to know other people is also doing workarounds for that problem. -- Gustavo Niemeyer http://niemeyer.net

Gustavo Niemeyer wrote:
In general, files don't have an encoding parameter - sys.stdout is an exception. The reason why this works for print and not for write is that I considered "print unicodeobject" important, and wanted to implement that. file.write is an entirely different code path, so it doesn't currently consider Unicode objects; instead, it only supports strings (or, more generally, buffers).
Apparently, it isn't important enough that somebody had analysed this, and offered a patch. In any case, it would be quite unreliable to pass unicode strings to .write even *if* .write supported .encoding, since most files don't have .encoding. Even sys.stdout does not always have .encoding - only when it is a terminal, and only if we managed to find out what the encoding of the terminal is. Regards, Martin

That's the only case I'd like to solve. If there are platforms that don't know how to set it, we could make the encoding attribute writable, and that would allow people to easily set it to the encoding which is deemed correct in their systems.
I understand your reasoning behind it, and would like to extend your idea to the write function, allowing anyone to use the common sys.stdout idiom to implement print-like functionality (like optparse and many others). For normal files, the absence of the encoding parameter would ensure the current behavior.
That's what I'm doing here! :-)
I think that's acceptable. The encoding parameter is meant for output streams, and Python does its best to try to find a reasonable value for showing output strings. Thanks for your answer and clarifications, -- Gustavo Niemeyer http://niemeyer.net

Gustavo Niemeyer wrote:
You are mixing things here: The source encoding is meant for the parser and defines the way Unicode literals are converted into Unicode objects. The encoding used on the stdout stream doesn't have anything to do with the source code encoding and has to be handled differently. The idiom presented by Bob is the right way to go: wrap sys.stdout with a StreamEncoder. Using sys.setdefaultencoding() is *not* the right solution to the problem. In general when writing programs that are targetted for i18n, you should use Unicode for all text data and convert from Unicode to 8-bit only at the IO/UI layer. The various wrappers in the codecs module make this rather easy. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 30 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

[...]
Sorry. I probably wasn't clear enough in my message. I understand the issue, and I'm not discussing source encoding at all. The only problem I'd like to solve is that of output streams not being able to have unicode strings written.
The idiom presented by Bob is the right way to go: wrap sys.stdout with a StreamEncoder.
I don't see that as a good solution, since every Python software that is internationalizaed will have do figure out this wrapping, introducing extra overhead unnecessarily.
Using sys.setdefaultencoding() is *not* the right solution to the problem.
I understand.
That's what I think as well. I just would expect that Python was kind enough to allow me to tell which output encoding I want, instead of wrapping the sys.stdout object with a non-native-file. IOW, being widely necessary, handling internationalization without wrapping sys.stdout everytime seems like a good step for a language like Python.
The various wrappers in the codecs module make this rather easy.
Thanks for the suggestion! -- Gustavo Niemeyer http://niemeyer.net

Gustavo Niemeyer wrote:
This wrapping is probably necessary for stateful encodings. If you had a sys.stdout.encoding=="utf-16", print would probably add the BOM every time a unicode object is printed. This doesn't happen if you wrap sys.stdout in a StreamWriter.
You can't have stateful encodings without something that keeps state. The only thing that does keep state in Python is a StreamReader/StreamWriter. Bye, Walter Dörwald

Hello Walter,
I'm not sure this is an issue for a terminal output stream, which is the case I'm trying to find a solution for. Otherwise, Python would already be in trouble for using this scheme in the print statement. Can you show an example of the print statement not working? -- Gustavo Niemeyer http://niemeyer.net

Gustavo Niemeyer wrote:
I don't see any unnecessary overhead and using the wrappers is really easy, e.g.: # # Application uses Latin-1 for I/O, terminal uses UTF-8 # import codecs, sys # Make stdout translate Latin-1 output into UTF-8 output sys.stdout = codecs.EncodedFile(sys.stdout, 'latin-1', 'utf-8') # Have stdin translate Latin-1 input into UTF-8 input sys.stdin = codecs.EncodedFile(sys.stdin, 'utf-8', 'latin-1') We should probably extend the support in StreamRecoder (which is used by the above EncodedFile helper) to also support Unicode input to .write() and have a special codec 'unicode' that converts Unicode to Unicode, so that you can request the EncodedFile object to return Unicode for .read(). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 01 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
participants (5)
-
"Martin v. Löwis"
-
Bob Ippolito
-
Gustavo Niemeyer
-
M.-A. Lemburg
-
Walter Dörwald