[Tutor] Unicode? UTF-8? UTF-16? WTF-8? ;)

Peter Otten __peter__ at web.de
Wed Sep 5 12:33:46 CEST 2012

Ray Jones wrote:

> I have directory names that contain Russian characters, Romanian
> characters, French characters, et al. When I search for a file using
> glob.glob(), I end up with stuff like \x93\x8c\xd1 in place of the
> directory names. I thought simply identifying them as Unicode would
> clear that up. Nope. Now I have stuff like \u0456\u0439\u043e.

That's the representation which is guaranteed to be all-ascii. Python will 
automatically apply repr() to a unicode string when it is part of a list

>>> files = [u"\u0456\u0439\u043e"] # files = glob.glob(unicode_pattern)
>>> print files

To see the actual characters print the unicode strings individually:

>>> for file in files:
...     print file

> These representations of directory names are eventually going to be
> passed to Dolphin (my file manager). Will they pass to Dolphin properly?

How exactly do you "pass" these names?

> Do I need to run a conversion? 

When you write them to a file you need to pick an encoding.

> Can that happen automatically within the
> script considering that the various types of characters are all mixed
> together in the same directory (i.e. # coding: Latin-1 at the top of the
> script is not going to address all the different types of characters).

the coding cookie tells python how to interpret the bytes in the files, so

# -*- coding: utf-8 -*-
s = u"äöü"


# -*- coding: latin1 -*-
s = u"äöü"

contain a different byte sequence on disc, but once imported the two strings 
are equal (and have the same in-memory layout):

>>> import codecs
>>> for encoding in "latin-1", "utf-8":
...     with codecs.open("tmp_%s.py" % encoding.replace("-", ""), "w", 
encoding=encoding) as f: f.write(u'# -*- coding: %s\ns = u"äöü"' % 
>>> for encoding in "latin1", "utf8":
...     open("tmp_%s.py" % encoding).read()
'# -*- coding: latin-1\ns = u"\xe4\xf6\xfc"'
'# -*- coding: utf-8\ns = u"\xc3\xa4\xc3\xb6\xc3\xbc"'
>>> from tmp_latin1 import s
>>> from tmp_utf8 import s as t
>>> s == t

> While on the subject, I just read through the Unicode info for Python
> 2.7.3. The history was interesting, but the implementation portion was
> beyond me. I was looking for a way for a Russian 'backward R' to look
> like a Russian 'backward R' - not for a bunch of \xxx and \uxxxxx stuff.

>>> ya
>>> print ya

This only works because Python correctly guesses the terminal encoding. If 
you are piping output to another file it will assume ascii and you will see 
an encoding error:

$ cat tmp.py
# -*- coding: utf-8 -*-
print u"Я"
$ python tmp.py
$ python tmp.py | cat
Traceback (most recent call last):
  File "tmp.py", line 2, in <module>
    print u"Я"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u042f' in 
position 0: ordinal not in range(128)

You can work around that by specifying the appropriate encoding explicitly:

$ python tmp2.py iso-8859-5 | cat
$ python tmp2.py latin1 | cat
Traceback (most recent call last):
  File "tmp2.py", line 4, in <module>
    print u"Я".encode(encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u042f' in 
position 0: ordinal not in range(256)

More information about the Tutor mailing list