Yet another unicode WTF

Ned Deily nad at
Fri Jun 5 03:03:30 EDT 2009

In article <8763fbmk5a.fsf at>,
 Ben Finney <ben+python at> wrote:
> Ned Deily <nad at> writes:
> > $ python2.6 -c 'import sys; print sys.stdout.encoding, \
> >  sys.stdout.isatty()'
> > UTF-8 True
> > $ python2.6 -c 'import sys; print sys.stdout.encoding, \
> >  sys.stdout.isatty()' > foo ; cat foo
> > None False
> So shouldn't the second case also detect UTF-8? The filesystem knows
> it's UTF-8, the shell knows it too. Why doesn't Python know it?

The filesystem knows what is UTF-8?  While the setting of the locale 
environment variables may influence how the file system interprets the 
*name* of a file, it has no direct influence on what the *contents* of a 
file is or is supposed to be.  Remember in python 2.x, a file is a just 
sequence of bytes.  If you want to write encode Unicode to the file, you 
need to use something like to wrap the file object with the 
proper streamwriter encoder.

What confuses matters in 2.x is the print statement's under-the-covers 
implicit Unicode encoding for files connected to a terminal:

>>> x = u'\u0430\u0431\u0432'
>>> print x
[nice looking characters here]
>>> sys.stdout.write(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 
0-2: ordinal not in range(128)
>>> sys.stdout.encoding

In python 3.x, of course, the encoding happens automatically but you 
still have to tell python, via the "encoding" argument to open, what the 
encoding of the file's content is (or accept python's default which may 
not be very useful):

>>> open('foo1','w').encoding

WTF, indeed.

 Ned Deily,
 nad at

More information about the Python-list mailing list