
Greetings,
Today, while trying to internationalize a program I'm working on, I found an interesting side-effect of how we're dealing with encoding of unicode strings while being written to files.
Suppose the following example:
# -*- encoding: iso-8859-1 -*- print u"á"
This will correctly print the string 'á', as expected. Now, what surprises me, is that the following code won't work in an equivalent way (unless using sys.setdefaultencoding()):
# -*- encoding: iso-8859-1 -*- import sys sys.stdout.write(u"á\n")
This will raise the following error:
Traceback (most recent call last): File "asd.py", line 3, in ? sys.stdout.write(u"á") UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0:ordinal not in range(128)
This difference may become a really annoying problem when trying to internationalize programs, since it's usual to see third-party code dealing with sys.stdout, instead of using 'print'. The standard optparse module, for instance, has a reference to sys.stdout which is used in the default --help handling mechanism.
Given the fact that files have an 'encoding' parameter, and that any unicode strings with characters not in the 0-127 range will raise an exception if being written to files, isn't it reasonable to respect the 'encoding' attribute whenever writing data to a file?
The workaround for that problem is to either use the evil-considered sys.setdefaultencoding(), or to wrap sys.stdout. IMO, both options seem unreasonable for such a common idiom.

On Nov 29, 2004, at 2:04 PM, Gustavo Niemeyer wrote:
Today, while trying to internationalize a program I'm working on, I found an interesting side-effect of how we're dealing with encoding of unicode strings while being written to files.
Suppose the following example:
# -*- encoding: iso-8859-1 -*- print u"á"
This will correctly print the string 'á', as expected. Now, what surprises me, is that the following code won't work in an equivalent way (unless using sys.setdefaultencoding()):
That doesn't work here, where sys.getdefaultencoding() is 'ascii', as expected.
# -*- encoding: iso-8859-1 -*- import sys sys.stdout.write(u"á\n")
This will raise the following error:
Traceback (most recent call last): File "asd.py", line 3, in ? sys.stdout.write(u"á") UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0:ordinal not in range(128)
That's expected.
This difference may become a really annoying problem when trying to internationalize programs, since it's usual to see third-party code dealing with sys.stdout, instead of using 'print'. The standard optparse module, for instance, has a reference to sys.stdout which is used in the default --help handling mechanism.
Given the fact that files have an 'encoding' parameter, and that any unicode strings with characters not in the 0-127 range will raise an exception if being written to files, isn't it reasonable to respect the 'encoding' attribute whenever writing data to a file?
No, because you don't know it's a file. You're calling a function with a unicode object. The function doesn't know that the object was some unicode object that came from a source file of some particular encoding.
The workaround for that problem is to either use the evil-considered sys.setdefaultencoding(), or to wrap sys.stdout. IMO, both options seem unreasonable for such a common idiom.
There's no guaranteed correlation whatsoever between the claimed encoding of your source document and the encoding of the user's terminal, why do you want there to be? What if you have some source files with 'foo' encoding and others with 'bar' encoding? What about ascii encoded source documents that use escape sequences to represent non-ascii characters? What you want doesn't make any sense so long as python strings and file objects deal in bytes not characters :)
Wrapping sys.stdout is the ONLY reasonable solution.
This is the idiom that I use. It's painless and works quite well:
import sys import codecs sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
-bob

Hello Bob,
[...]
Given the fact that files have an 'encoding' parameter, and that any unicode strings with characters not in the 0-127 range will raise an exception if being written to files, isn't it reasonable to respect the 'encoding' attribute whenever writing data to a file?
No, because you don't know it's a file. You're calling a function with a unicode object. The function doesn't know that the object was some unicode object that came from a source file of some particular encoding.
I don't understand what you're saying here. The file knows itself is a file. The write function knows the parameter is unicode.
The workaround for that problem is to either use the evil-considered sys.setdefaultencoding(), or to wrap sys.stdout. IMO, both options seem unreasonable for such a common idiom.
There's no guaranteed correlation whatsoever between the claimed encoding of your source document and the encoding of the user's terminal, why do you want there to be? What if you have some source
I don't. I want the write() function of file objects to respect the encoding attribute of these objects. This is already being done when print is used. I'm proposing to extend that behavior to the write function. That's all.
files with 'foo' encoding and others with 'bar' encoding? What about ascii encoded source documents that use escape sequences to represent non-ascii characters? What you want doesn't make any sense so long as python strings and file objects deal in bytes not characters :)
Please, take a long breath, and read my message again. :-)
Wrapping sys.stdout is the ONLY reasonable solution.
[...]
No, it's not. But I'm glad to know other people is also doing workarounds for that problem.

Gustavo Niemeyer wrote:
Given the fact that files have an 'encoding' parameter, and that any unicode strings with characters not in the 0-127 range will raise an exception if being written to files, isn't it reasonable to respect the 'encoding' attribute whenever writing data to a file?
In general, files don't have an encoding parameter - sys.stdout is an exception.
The reason why this works for print and not for write is that I considered "print unicodeobject" important, and wanted to implement that. file.write is an entirely different code path, so it doesn't currently consider Unicode objects; instead, it only supports strings (or, more generally, buffers).
This difference may become a really annoying problem when trying to internationalize programs, since it's usual to see third-party code dealing with sys.stdout, instead of using 'print'.
Apparently, it isn't important enough that somebody had analysed this, and offered a patch. In any case, it would be quite unreliable to pass unicode strings to .write even *if* .write supported .encoding, since most files don't have .encoding. Even sys.stdout does not always have .encoding - only when it is a terminal, and only if we managed to find out what the encoding of the terminal is.
Regards, Martin

Gustavo Niemeyer wrote:
Given the fact that files have an 'encoding' parameter, and that any unicode strings with characters not in the 0-127 range will raise an exception if being written to files, isn't it reasonable to respect the 'encoding' attribute whenever writing data to a file?
In general, files don't have an encoding parameter - sys.stdout is an exception.
That's the only case I'd like to solve.
If there are platforms that don't know how to set it, we could make the encoding attribute writable, and that would allow people to easily set it to the encoding which is deemed correct in their systems.
The reason why this works for print and not for write is that I considered "print unicodeobject" important, and wanted to implement that. file.write is an entirely different code path, so it doesn't currently consider Unicode objects; instead, it only supports strings (or, more generally, buffers).
I understand your reasoning behind it, and would like to extend your idea to the write function, allowing anyone to use the common sys.stdout idiom to implement print-like functionality (like optparse and many others). For normal files, the absence of the encoding parameter would ensure the current behavior.
This difference may become a really annoying problem when trying to internationalize programs, since it's usual to see third-party code dealing with sys.stdout, instead of using 'print'.
Apparently, it isn't important enough that somebody had analysed this, and offered a patch. In any case, it would be quite unreliable to
That's what I'm doing here! :-)
pass unicode strings to .write even *if* .write supported .encoding, since most files don't have .encoding. Even sys.stdout does not always have .encoding - only when it is a terminal, and only if we managed to find out what the encoding of the terminal is.
I think that's acceptable. The encoding parameter is meant for output streams, and Python does its best to try to find a reasonable value for showing output strings.
Thanks for your answer and clarifications,

Gustavo Niemeyer wrote:
Greetings,
Today, while trying to internationalize a program I'm working on, I found an interesting side-effect of how we're dealing with encoding of unicode strings while being written to files.
Suppose the following example:
# -*- encoding: iso-8859-1 -*- print u"á"
This will correctly print the string 'á', as expected. Now, what surprises me, is that the following code won't work in an equivalent way (unless using sys.setdefaultencoding()):
# -*- encoding: iso-8859-1 -*- import sys sys.stdout.write(u"á\n")
This will raise the following error:
Traceback (most recent call last): File "asd.py", line 3, in ? sys.stdout.write(u"á") UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0:ordinal not in range(128)
This difference may become a really annoying problem when trying to internationalize programs, since it's usual to see third-party code dealing with sys.stdout, instead of using 'print'. The standard optparse module, for instance, has a reference to sys.stdout which is used in the default --help handling mechanism.
You are mixing things here:
The source encoding is meant for the parser and defines the way Unicode literals are converted into Unicode objects.
The encoding used on the stdout stream doesn't have anything to do with the source code encoding and has to be handled differently.
The idiom presented by Bob is the right way to go: wrap sys.stdout with a StreamEncoder.
Using sys.setdefaultencoding() is *not* the right solution to the problem.
In general when writing programs that are targetted for i18n, you should use Unicode for all text data and convert from Unicode to 8-bit only at the IO/UI layer.
The various wrappers in the codecs module make this rather easy.

[...]
You are mixing things here:
The source encoding is meant for the parser and defines the way Unicode literals are converted into Unicode objects.
The encoding used on the stdout stream doesn't have anything to do with the source code encoding and has to be handled differently.
Sorry. I probably wasn't clear enough in my message. I understand the issue, and I'm not discussing source encoding at all. The only problem I'd like to solve is that of output streams not being able to have unicode strings written.
The idiom presented by Bob is the right way to go: wrap sys.stdout with a StreamEncoder.
I don't see that as a good solution, since every Python software that is internationalizaed will have do figure out this wrapping, introducing extra overhead unnecessarily.
Using sys.setdefaultencoding() is *not* the right solution to the problem.
I understand.
In general when writing programs that are targetted for i18n, you should use Unicode for all text data and convert from Unicode to 8-bit only at the IO/UI layer.
That's what I think as well. I just would expect that Python was kind enough to allow me to tell which output encoding I want, instead of wrapping the sys.stdout object with a non-native-file.
IOW, being widely necessary, handling internationalization without wrapping sys.stdout everytime seems like a good step for a language like Python.
The various wrappers in the codecs module make this rather easy.
Thanks for the suggestion!

Gustavo Niemeyer wrote:
[...]
You are mixing things here:
The source encoding is meant for the parser and defines the way Unicode literals are converted into Unicode objects.
The encoding used on the stdout stream doesn't have anything to do with the source code encoding and has to be handled differently.
Sorry. I probably wasn't clear enough in my message. I understand the issue, and I'm not discussing source encoding at all. The only problem I'd like to solve is that of output streams not being able to have unicode strings written.
The idiom presented by Bob is the right way to go: wrap sys.stdout with a StreamEncoder.
I don't see that as a good solution, since every Python software that is internationalizaed will have do figure out this wrapping, introducing extra overhead unnecessarily.
This wrapping is probably necessary for stateful encodings. If you had a sys.stdout.encoding=="utf-16", print would probably add the BOM every time a unicode object is printed. This doesn't happen if you wrap sys.stdout in a StreamWriter.
[...] That's what I think as well. I just would expect that Python was kind enough to allow me to tell which output encoding I want, instead of wrapping the sys.stdout object with a non-native-file.
IOW, being widely necessary, handling internationalization without wrapping sys.stdout everytime seems like a good step for a language like Python.
You can't have stateful encodings without something that keeps state. The only thing that does keep state in Python is a StreamReader/StreamWriter.
Bye, Walter Dörwald

Hello Walter,
I don't see that as a good solution, since every Python software that is internationalizaed will have do figure out this wrapping, introducing extra overhead unnecessarily.
This wrapping is probably necessary for stateful encodings. If you had a sys.stdout.encoding=="utf-16", print would probably add the BOM every time a unicode object is printed. This doesn't happen if you wrap sys.stdout in a StreamWriter.
I'm not sure this is an issue for a terminal output stream, which is the case I'm trying to find a solution for. Otherwise, Python would already be in trouble for using this scheme in the print statement. Can you show an example of the print statement not working?

Gustavo Niemeyer wrote:
Hello Walter,
I don't see that as a good solution, since every Python software that is internationalizaed will have do figure out this wrapping, introducing extra overhead unnecessarily.
This wrapping is probably necessary for stateful encodings. If you had a sys.stdout.encoding=="utf-16", print would probably add the BOM every time a unicode object is printed. This doesn't happen if you wrap sys.stdout in a StreamWriter.
I'm not sure this is an issue for a terminal output stream, which is the case I'm trying to find a solution for. Otherwise, Python would already be in trouble for using this scheme in the print statement. Can you show an example of the print statement not working?
No, I can't. Python doesn't accept UTF-16 as encoding.
This works:
LANG=de_DE.UTF-8 python2.4
Python 2.4 (#1, Nov 30 2004, 14:16:24) [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-113)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import sys sys.stdout.encoding
'UTF-8'
This doesn't:
LANG=de_DE.UTF-16 python2.4
Python 2.4 (#1, Nov 30 2004, 14:16:24) [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-113)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import sys sys.stdout.encoding
'ANSI_X3.4-1968'
Bye, Walter Dörwald

Gustavo Niemeyer wrote:
[...]
The idiom presented by Bob is the right way to go: wrap sys.stdout with a StreamWriter.
I don't see that as a good solution, since every Python software that is internationalizaed will have do figure out this wrapping, introducing extra overhead unnecessarily.
I don't see any unnecessary overhead and using the wrappers is really easy, e.g.:
# # Application uses Latin-1 for I/O, terminal uses UTF-8 # import codecs, sys
# Make stdout translate Latin-1 output into UTF-8 output sys.stdout = codecs.EncodedFile(sys.stdout, 'latin-1', 'utf-8')
# Have stdin translate Latin-1 input into UTF-8 input sys.stdin = codecs.EncodedFile(sys.stdin, 'utf-8', 'latin-1')
We should probably extend the support in StreamRecoder (which is used by the above EncodedFile helper) to also support Unicode input to .write() and have a special codec 'unicode' that converts Unicode to Unicode, so that you can request the EncodedFile object to return Unicode for .read().
participants (5)
-
"Martin v. Löwis"
-
Bob Ippolito
-
Gustavo Niemeyer
-
M.-A. Lemburg
-
Walter Dörwald