[Tutor] UTF-8 filenames encountered in os.walk

Wed Jul 4 20:26:42 CEST 2007

William O'Higgins Witteman wrote:

> The problem is that the Windows filesystem uses UTF-8 as the encoding
> for filenames,

That's not what I get. For example, I made a file called "Tést.txt" and 
looked at what os.listdir() gives me. (os.listdir() is what os.walk() 
uses to get the file and directory names.) If I pass a byte string as 
the directory name, I get byte strings back, not in utf-8, but 
apparently in cp1252 (or latin-1, but this is Windows so it's probably 
cp1252):
 >>> os.listdir('C:\Documents and Settings')
['Administrator', 'All Users', 'Default User', 'LocalService', 
'NetworkService', 'T\xe9st.txt']

Note the \xe9 which is the cp1252 representation of é.

If I give the directory as a unicode string, the results are all unicode 
strings as well:
 >>> os.listdir(u'C:\Documents and Settings')
[u'Administrator', u'All Users', u'Default User', u'LocalService', 
u'NetworkService', u'T\xe9st.txt']

In neither case does it give me utf-8.

 > but os doesn't seem to have a UTF-8 mode, just an ascii
 > mode and a Unicode mode.

It has a unicode string mode and a byte string mode.

> The code is quite complex for not-relevant-to-this-problem reasons.  The
> gist is that I walk the FS, get filenames, some of which get written to
> an XML file.  If I leave the output alone I get errors on reading the
> XML file.  

What kind of errors? Be specific! Show the code that generates the error.

I'll hazard a guess that you are writing the cp1252 characters to the 
XML file but not specifying the charset of the file, or specifying it as 
utf-8, and the reader croaks on the cp1252.

 > If I try to change the output so that it is all Unicode, I
 > get errors because my UTF-8 data sometimes looks like ascii,

How do you change the output? What do you mean, the utf-8 data looks 
like ascii? Ascii data *is* utf-8, they should look the same.

 > I don't
 > see a UTF-8-to-Unicode converter in the docs.

If s is a byte string containing utf-8, then s.decode('utf-8') is the 
equivalent unicode string.

>>> I suspect that my program will have to make sure to recast all
>>> equivalent-to-ascii strings as UTF-8 while leaving the ones that are
>>> already extended alone.
>> It is nonsense to talk about 'recasting' an ascii string as UTF-8; an 
>> ascii string is *already* UTF-8 because the representation of the 
>> characters is identical. OTOH it makes sense to talk about converting an 
>> ascii string to a unicode string.
> 
> Then what does mystring.encode("UTF-8") do?

It depends on what mystring is. If it is a unicode string, it converts 
it to a plain (byte) string containing the utf-8 representation of 
mystring. For example,
In [8]: s=u'\xe9'  # Note the leading "u" - this is a unicode string
In [9]: s.encode('utf-8')
Out[9]: '\xc3\xa9'

If mystring is a string, it is converted to a unicode string using the 
default encoding (ascii unless you have changed it), then that string is 
converted to utf-8. This can work out two ways:
- if mystring originally contained only ascii characters, the result is 
identical to the original:
In [1]: s='abc'
In [2]: s.encode('utf-8')
Out[2]: 'abc'
In [4]: s.encode('utf-8') == s
Out[4]: True

- if mystring contains non-ascii characters, then the implicit *decode* 
using the ascii codec will fail with an exception:
In [5]: s = '\303\251'
In [6]: s.encode('utf-8')
------------------------------------------------------------
Traceback (most recent call last):
   File "<ipython console>", line 1, in <module>
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 
0xc3 in position 0: ordinal not in range(128)

Note this is exactly the same error you would get if you explicitly 
tried to convert to unicode using the ascii codec, because that is what 
is happening under the hood:

In [11]: s.decode('ascii')
------------------------------------------------------------
Traceback (most recent call last):
   File "<ipython console>", line 1, in <module>
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 
0xc3 in position 0: ordinal not in range(128)

Again, it would really help if you would
- show some code
- show some data
- learn more about unicode, utf-8, character encodings and python strings.

Kent