[Tutor] UTF-8 filenames encountered in os.walk
Kent Johnson
kent37 at tds.net
Wed Jul 4 20:26:42 CEST 2007
William O'Higgins Witteman wrote:
> The problem is that the Windows filesystem uses UTF-8 as the encoding
> for filenames,
That's not what I get. For example, I made a file called "Tést.txt" and
looked at what os.listdir() gives me. (os.listdir() is what os.walk()
uses to get the file and directory names.) If I pass a byte string as
the directory name, I get byte strings back, not in utf-8, but
apparently in cp1252 (or latin-1, but this is Windows so it's probably
cp1252):
>>> os.listdir('C:\Documents and Settings')
['Administrator', 'All Users', 'Default User', 'LocalService',
'NetworkService', 'T\xe9st.txt']
Note the \xe9 which is the cp1252 representation of é.
If I give the directory as a unicode string, the results are all unicode
strings as well:
>>> os.listdir(u'C:\Documents and Settings')
[u'Administrator', u'All Users', u'Default User', u'LocalService',
u'NetworkService', u'T\xe9st.txt']
In neither case does it give me utf-8.
> but os doesn't seem to have a UTF-8 mode, just an ascii
> mode and a Unicode mode.
It has a unicode string mode and a byte string mode.
> The code is quite complex for not-relevant-to-this-problem reasons. The
> gist is that I walk the FS, get filenames, some of which get written to
> an XML file. If I leave the output alone I get errors on reading the
> XML file.
What kind of errors? Be specific! Show the code that generates the error.
I'll hazard a guess that you are writing the cp1252 characters to the
XML file but not specifying the charset of the file, or specifying it as
utf-8, and the reader croaks on the cp1252.
> If I try to change the output so that it is all Unicode, I
> get errors because my UTF-8 data sometimes looks like ascii,
How do you change the output? What do you mean, the utf-8 data looks
like ascii? Ascii data *is* utf-8, they should look the same.
> I don't
> see a UTF-8-to-Unicode converter in the docs.
If s is a byte string containing utf-8, then s.decode('utf-8') is the
equivalent unicode string.
>>> I suspect that my program will have to make sure to recast all
>>> equivalent-to-ascii strings as UTF-8 while leaving the ones that are
>>> already extended alone.
>> It is nonsense to talk about 'recasting' an ascii string as UTF-8; an
>> ascii string is *already* UTF-8 because the representation of the
>> characters is identical. OTOH it makes sense to talk about converting an
>> ascii string to a unicode string.
>
> Then what does mystring.encode("UTF-8") do?
It depends on what mystring is. If it is a unicode string, it converts
it to a plain (byte) string containing the utf-8 representation of
mystring. For example,
In [8]: s=u'\xe9' # Note the leading "u" - this is a unicode string
In [9]: s.encode('utf-8')
Out[9]: '\xc3\xa9'
If mystring is a string, it is converted to a unicode string using the
default encoding (ascii unless you have changed it), then that string is
converted to utf-8. This can work out two ways:
- if mystring originally contained only ascii characters, the result is
identical to the original:
In [1]: s='abc'
In [2]: s.encode('utf-8')
Out[2]: 'abc'
In [4]: s.encode('utf-8') == s
Out[4]: True
- if mystring contains non-ascii characters, then the implicit *decode*
using the ascii codec will fail with an exception:
In [5]: s = '\303\251'
In [6]: s.encode('utf-8')
------------------------------------------------------------
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte
0xc3 in position 0: ordinal not in range(128)
Note this is exactly the same error you would get if you explicitly
tried to convert to unicode using the ascii codec, because that is what
is happening under the hood:
In [11]: s.decode('ascii')
------------------------------------------------------------
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte
0xc3 in position 0: ordinal not in range(128)
Again, it would really help if you would
- show some code
- show some data
- learn more about unicode, utf-8, character encodings and python strings.
Kent
More information about the Tutor
mailing list