[Python-Dev] File encodings

Mon Nov 29 20:04:48 CET 2004

Greetings,

Today, while trying to internationalize a program I'm working on,
I found an interesting side-effect of how we're dealing with
encoding of unicode strings while being written to files.

Suppose the following example:

  # -*- encoding: iso-8859-1 -*-
  print u"á"

This will correctly print the string 'á', as expected. Now, what
surprises me, is that the following code won't work in an equivalent
way (unless using sys.setdefaultencoding()):

  # -*- encoding: iso-8859-1 -*-
  import sys
  sys.stdout.write(u"á\n")

This will raise the following error:

  Traceback (most recent call last):
    File "asd.py", line 3, in ?
      sys.stdout.write(u"á")
  UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
                      in position 0:ordinal not in range(128)

This difference may become a really annoying problem when trying to
internationalize programs, since it's usual to see third-party code
dealing with sys.stdout, instead of using 'print'. The standard
optparse module, for instance, has a reference to sys.stdout which
is used in the default --help handling mechanism.

Given the fact that files have an 'encoding' parameter, and that
any unicode strings with characters not in the 0-127 range will
raise an exception if being written to files, isn't it reasonable
to respect the 'encoding' attribute whenever writing data to a
file?

The workaround for that problem is to either use the evil-considered
sys.setdefaultencoding(), or to wrap sys.stdout. IMO, both options
seem unreasonable for such a common idiom.

-- 
Gustavo Niemeyer
http://niemeyer.net