[Tutor] UTF-8 filenames encountered in os.walk

Kent Johnson kent37 at tds.net
Wed Jul 4 17:28:53 CEST 2007


William O'Higgins Witteman wrote:
>> for thing in os.walk(u'.'):
>>
>> instead of:
>>
>> for thing in os.walk('.'): 
> 
> This is a good thought, and the crux of the problem.  I pull the
> starting directories from an XML file which is UTF-8, but by the time it
> hits my program, because there are no extended characters in the
> starting path, os.walk assumes ascii.  So, I recast the string as UTF-8,
> and I get UTF-8 output.  The problem happens further down the line.
> 
> I get a list of paths from the results of os.walk, all in UTF-8, but not
> identified as such.  If I just pass my list to other parts of the
> program it seems to assume either ascii or UTF-8, based on the
> individual list elements.  If I try to cast the whole list as UTF-8, I
> get an exception because it is assuming ascii and receiving UTF-8 for
> some list elements.

FWIW, I'm pretty sure you are confusing Unicode strings and UTF-8
strings, they are not the same thing. A Unicode string uses 16 bits to
represent each character. It is a distinct data type from a 'regular'
string. Regular Python strings are byte strings with an implicit
encoding. One possible encoding is UTF-8 which uses one or more bytes to
represent each character.

Some good reading on Unicode and utf-8:
http://www.joelonsoftware.com/articles/Unicode.html
http://effbot.org/zone/unicode-objects.htm

If you pass a unicode string (not utf-8) to os.walk(), the resulting 
lists will also be unicode.

Again, it would be helpful to see the code that is getting the error.

> I suspect that my program will have to make sure to recast all
> equivalent-to-ascii strings as UTF-8 while leaving the ones that are
> already extended alone.

It is nonsense to talk about 'recasting' an ascii string as UTF-8; an 
ascii string is *already* UTF-8 because the representation of the 
characters is identical. OTOH it makes sense to talk about converting an 
ascii string to a unicode string.

Kent


More information about the Tutor mailing list