[Python-Dev] File encodings
Bob Ippolito
bob at redivi.com
Tue Nov 30 04:45:26 CET 2004
On Nov 29, 2004, at 2:04 PM, Gustavo Niemeyer wrote:
> Today, while trying to internationalize a program I'm working on,
> I found an interesting side-effect of how we're dealing with
> encoding of unicode strings while being written to files.
>
> Suppose the following example:
>
> # -*- encoding: iso-8859-1 -*-
> print u"á"
>
> This will correctly print the string 'á', as expected. Now, what
> surprises me, is that the following code won't work in an equivalent
> way (unless using sys.setdefaultencoding()):
That doesn't work here, where sys.getdefaultencoding() is 'ascii', as
expected.
> # -*- encoding: iso-8859-1 -*-
> import sys
> sys.stdout.write(u"á\n")
>
> This will raise the following error:
>
> Traceback (most recent call last):
> File "asd.py", line 3, in ?
> sys.stdout.write(u"á")
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
> in position 0:ordinal not in range(128)
That's expected.
> This difference may become a really annoying problem when trying to
> internationalize programs, since it's usual to see third-party code
> dealing with sys.stdout, instead of using 'print'. The standard
> optparse module, for instance, has a reference to sys.stdout which
> is used in the default --help handling mechanism.
>
> Given the fact that files have an 'encoding' parameter, and that
> any unicode strings with characters not in the 0-127 range will
> raise an exception if being written to files, isn't it reasonable
> to respect the 'encoding' attribute whenever writing data to a
> file?
No, because you don't know it's a file. You're calling a function with
a unicode object. The function doesn't know that the object was some
unicode object that came from a source file of some particular
encoding.
> The workaround for that problem is to either use the evil-considered
> sys.setdefaultencoding(), or to wrap sys.stdout. IMO, both options
> seem unreasonable for such a common idiom.
There's no guaranteed correlation whatsoever between the claimed
encoding of your source document and the encoding of the user's
terminal, why do you want there to be? What if you have some source
files with 'foo' encoding and others with 'bar' encoding? What about
ascii encoded source documents that use escape sequences to represent
non-ascii characters? What you want doesn't make any sense so long as
python strings and file objects deal in bytes not characters :)
Wrapping sys.stdout is the ONLY reasonable solution.
This is the idiom that I use. It's painless and works quite well:
import sys
import codecs
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
-bob
More information about the Python-Dev
mailing list