Yet another unicode WTF
Ned Deily
nad at acm.org
Fri Jun 5 03:03:30 EDT 2009
In article <8763fbmk5a.fsf at benfinney.id.au>,
Ben Finney <ben+python at benfinney.id.au> wrote:
> Ned Deily <nad at acm.org> writes:
> > $ python2.6 -c 'import sys; print sys.stdout.encoding, \
> > sys.stdout.isatty()'
> > UTF-8 True
> > $ python2.6 -c 'import sys; print sys.stdout.encoding, \
> > sys.stdout.isatty()' > foo ; cat foo
> > None False
>
> So shouldn't the second case also detect UTF-8? The filesystem knows
> it's UTF-8, the shell knows it too. Why doesn't Python know it?
The filesystem knows what is UTF-8? While the setting of the locale
environment variables may influence how the file system interprets the
*name* of a file, it has no direct influence on what the *contents* of a
file is or is supposed to be. Remember in python 2.x, a file is a just
sequence of bytes. If you want to write encode Unicode to the file, you
need to use something like codecs.open to wrap the file object with the
proper streamwriter encoder.
What confuses matters in 2.x is the print statement's under-the-covers
implicit Unicode encoding for files connected to a terminal:
http://bugs.python.org/issue612627
http://bugs.python.org/issue4947
http://wiki.python.org/moin/PrintFails
>>> x = u'\u0430\u0431\u0432'
>>> print x
[nice looking characters here]
>>> sys.stdout.write(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)
>>> sys.stdout.encoding
'UTF-8'
In python 3.x, of course, the encoding happens automatically but you
still have to tell python, via the "encoding" argument to open, what the
encoding of the file's content is (or accept python's default which may
not be very useful):
>>> open('foo1','w').encoding
'mac-roman'
WTF, indeed.
--
Ned Deily,
nad at acm.org
More information about the Python-list
mailing list