[Tutor] UTF-8 filenames encountered in os.walk

Kent Johnson kent37 at tds.net
Wed Jul 4 22:25:33 CEST 2007


William O'Higgins Witteman wrote:
> On Wed, Jul 04, 2007 at 02:47:45PM -0400, Kent Johnson wrote:
> 
>> encode() really wants a unicode string not a byte string. If you call 
>> encode() on a byte string, the string is first converted to unicode 
>> using the default encoding (usually ascii), then converted with the 
>> given encoding.
> 
> Aha!  That helps.  Something else that helps is that my Python code is
> generating output that is received by several other tools.  Interesting
> facts:
> 
> Not all .NET XML parsers (nor IE6) accept valid UTF-8 XML.

Yikes! Are you sure it isn't a problem with your XML?

> I am indeed seeing filenames in cp1252, even though the Microsoft docs
> say that filenames are in UTF-8.
> 
> Filenames in Arabic are in UTF-8.

Not on my computer (Win XP) in os.listdir(). With filenames of Tést.txt 
and ق.txt (that's \u0642, an Arabic character), os.listdir() gives me
 >>> os.listdir('.')
['Administrator', 'All Users', 'Default User', 'LocalService', 
'NetworkService', 'T\xe9st.txt', '?.txt']
 >>> os.listdir(u'.')
[u'Administrator', u'All Users', u'Default User', u'LocalService', 
u'NetworkService', u'T\xe9st.txt', u'\u0642.txt']

So with a byte string directory it fails, with a unicode directory it 
gives unicode, not utf-8.

> What I have to do is to check the encoding of the filename as received
> by os.walk (and thus os.listdir) and convert them to Unicode, continue
> to process them, and then encode them as UTF-8 for output to XML.

How do you do that? AFAIK there is no completely reliable way to 
determine the encoding of a byte string by looking at it; the most 
common approach is to try to find one that successfully decodes the 
string; more sophisticated variations look at the distribution of 
character codes.

Anyway if you use the Unicode file names you shouldn't have to worry 
about this.

Kent


More information about the Tutor mailing list