[Tutor] UTF-8 filenames encountered in os.walk

William O'Higgins Witteman hmm at woolgathering.cx
Wed Jul 4 21:07:40 CEST 2007


On Wed, Jul 04, 2007 at 02:47:45PM -0400, Kent Johnson wrote:

>encode() really wants a unicode string not a byte string. If you call 
>encode() on a byte string, the string is first converted to unicode 
>using the default encoding (usually ascii), then converted with the 
>given encoding.

Aha!  That helps.  Something else that helps is that my Python code is
generating output that is received by several other tools.  Interesting
facts:

Not all .NET XML parsers (nor IE6) accept valid UTF-8 XML.
I am indeed seeing filenames in cp1252, even though the Microsoft docs
say that filenames are in UTF-8.

Filenames in Arabic are in UTF-8.

What I have to do is to check the encoding of the filename as received
by os.walk (and thus os.listdir) and convert them to Unicode, continue
to process them, and then encode them as UTF-8 for output to XML.

In trying to work around bad 3rd party tools and inconsistent data I
introduced errors in my Python code.  The problem was in treating all
filenames the same way, when they were not being created the same way by
the filesystem.

Thanks for all the help and suggestions.
-- 

yours,

William


More information about the Tutor mailing list