[Tutor] UTF-8 filenames encountered in os.walk
Kent Johnson
kent37 at tds.net
Wed Jul 4 22:25:33 CEST 2007
William O'Higgins Witteman wrote:
> On Wed, Jul 04, 2007 at 02:47:45PM -0400, Kent Johnson wrote:
>
>> encode() really wants a unicode string not a byte string. If you call
>> encode() on a byte string, the string is first converted to unicode
>> using the default encoding (usually ascii), then converted with the
>> given encoding.
>
> Aha! That helps. Something else that helps is that my Python code is
> generating output that is received by several other tools. Interesting
> facts:
>
> Not all .NET XML parsers (nor IE6) accept valid UTF-8 XML.
Yikes! Are you sure it isn't a problem with your XML?
> I am indeed seeing filenames in cp1252, even though the Microsoft docs
> say that filenames are in UTF-8.
>
> Filenames in Arabic are in UTF-8.
Not on my computer (Win XP) in os.listdir(). With filenames of Tést.txt
and ق.txt (that's \u0642, an Arabic character), os.listdir() gives me
>>> os.listdir('.')
['Administrator', 'All Users', 'Default User', 'LocalService',
'NetworkService', 'T\xe9st.txt', '?.txt']
>>> os.listdir(u'.')
[u'Administrator', u'All Users', u'Default User', u'LocalService',
u'NetworkService', u'T\xe9st.txt', u'\u0642.txt']
So with a byte string directory it fails, with a unicode directory it
gives unicode, not utf-8.
> What I have to do is to check the encoding of the filename as received
> by os.walk (and thus os.listdir) and convert them to Unicode, continue
> to process them, and then encode them as UTF-8 for output to XML.
How do you do that? AFAIK there is no completely reliable way to
determine the encoding of a byte string by looking at it; the most
common approach is to try to find one that successfully decodes the
string; more sophisticated variations look at the distribution of
character codes.
Anyway if you use the Unicode file names you shouldn't have to worry
about this.
Kent
More information about the Tutor
mailing list