[Tutor] UTF-8 filenames encountered in os.walk

Wed Jul 4 22:25:33 CEST 2007

William O'Higgins Witteman wrote:
> On Wed, Jul 04, 2007 at 02:47:45PM -0400, Kent Johnson wrote:
> 
>> encode() really wants a unicode string not a byte string. If you call 
>> encode() on a byte string, the string is first converted to unicode 
>> using the default encoding (usually ascii), then converted with the 
>> given encoding.
> 
> Aha!  That helps.  Something else that helps is that my Python code is
> generating output that is received by several other tools.  Interesting
> facts:
> 
> Not all .NET XML parsers (nor IE6) accept valid UTF-8 XML.

Yikes! Are you sure it isn't a problem with your XML?

> I am indeed seeing filenames in cp1252, even though the Microsoft docs
> say that filenames are in UTF-8.
> 
> Filenames in Arabic are in UTF-8.

Not on my computer (Win XP) in os.listdir(). With filenames of Tést.txt 
and ق.txt (that's \u0642, an Arabic character), os.listdir() gives me
 >>> os.listdir('.')
['Administrator', 'All Users', 'Default User', 'LocalService', 
'NetworkService', 'T\xe9st.txt', '?.txt']
 >>> os.listdir(u'.')
[u'Administrator', u'All Users', u'Default User', u'LocalService', 
u'NetworkService', u'T\xe9st.txt', u'\u0642.txt']

So with a byte string directory it fails, with a unicode directory it 
gives unicode, not utf-8.

> What I have to do is to check the encoding of the filename as received
> by os.walk (and thus os.listdir) and convert them to Unicode, continue
> to process them, and then encode them as UTF-8 for output to XML.

How do you do that? AFAIK there is no completely reliable way to 
determine the encoding of a byte string by looking at it; the most 
common approach is to try to find one that successfully decodes the 
string; more sophisticated variations look at the distribution of 
character codes.

Anyway if you use the Unicode file names you shouldn't have to worry 
about this.

Kent