Gustavo Niemeyer wrote:
Today, while trying to internationalize a program I'm working on, I found an interesting side-effect of how we're dealing with encoding of unicode strings while being written to files.
Suppose the following example:
# -*- encoding: iso-8859-1 -*- print u"á"
This will correctly print the string 'á', as expected. Now, what surprises me, is that the following code won't work in an equivalent way (unless using sys.setdefaultencoding()):
# -*- encoding: iso-8859-1 -*- import sys sys.stdout.write(u"á\n")
This will raise the following error:
Traceback (most recent call last): File "asd.py", line 3, in ? sys.stdout.write(u"á") UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0:ordinal not in range(128)
This difference may become a really annoying problem when trying to internationalize programs, since it's usual to see third-party code dealing with sys.stdout, instead of using 'print'. The standard optparse module, for instance, has a reference to sys.stdout which is used in the default --help handling mechanism.
You are mixing things here:
The source encoding is meant for the parser and defines the way Unicode literals are converted into Unicode objects.
The encoding used on the stdout stream doesn't have anything to do with the source code encoding and has to be handled differently.
The idiom presented by Bob is the right way to go: wrap sys.stdout with a StreamEncoder.
Using sys.setdefaultencoding() is *not* the right solution to the problem.
In general when writing programs that are targetted for i18n, you should use Unicode for all text data and convert from Unicode to 8-bit only at the IO/UI layer.
The various wrappers in the codecs module make this rather easy.