[Tutor] UTF-8 filenames encountered in os.walk

Wed Jul 4 18:48:40 CEST 2007

On Wed, 2007-07-04 at 12:00 -0400, William O'Higgins Witteman wrote:
> On Wed, Jul 04, 2007 at 11:28:53AM -0400, Kent Johnson wrote:
> 
> >FWIW, I'm pretty sure you are confusing Unicode strings and UTF-8
> >strings, they are not the same thing. A Unicode string uses 16 bits to
> >represent each character. It is a distinct data type from a 'regular'
> >string. Regular Python strings are byte strings with an implicit
> >encoding. One possible encoding is UTF-8 which uses one or more bytes to
> >represent each character.
> >
> >Some good reading on Unicode and utf-8:
> >http://www.joelonsoftware.com/articles/Unicode.html
> >http://effbot.org/zone/unicode-objects.htm
> 
> The problem is that the Windows filesystem uses UTF-8 as the encoding
> for filenames, but os doesn't seem to have a UTF-8 mode, just an ascii
> mode and a Unicode mode.

Are you converting your utf-8 strings to unicode?

unicode_file_name = utf8_file_name.decode('UTF-8')

> >If you pass a unicode string (not utf-8) to os.walk(), the resulting 
> >lists will also be unicode.
> >
> >Again, it would be helpful to see the code that is getting the error.
> 
> The code is quite complex for not-relevant-to-this-problem reasons.  The
> gist is that I walk the FS, get filenames, some of which get written to
> an XML file.  If I leave the output alone I get errors on reading the
> XML file.  If I try to change the output so that it is all Unicode, I
> get errors because my UTF-8 data sometimes looks like ascii, and I don't
> see a UTF-8-to-Unicode converter in the docs.
> 

It is probably worth the effort to put together a simpler piece of code
that can illustrate the problem.

> >>I suspect that my program will have to make sure to recast all
> >>equivalent-to-ascii strings as UTF-8 while leaving the ones that are
> >>already extended alone.
> >
> >It is nonsense to talk about 'recasting' an ascii string as UTF-8; an 
> >ascii string is *already* UTF-8 because the representation of the 
> >characters is identical. OTOH it makes sense to talk about converting an 
> >ascii string to a unicode string.
> 
> Then what does mystring.encode("UTF-8") do?

It uses utf8 encoding rules to convert mystring FROM unicode to a
string.  If mystring is *NOT* unicode but simply a string, it appears to
do a round trip decode and encode of the string.  This allows you to
find encoding errors, but if there are no errors the result is the same
as what you started with.

The data in a file (streams of bytes) are encoded to represent unicode
characters.  The stream must be decoded to recover the underlying
unicode.  The unicode must be encoded when written to files.  utf-8 is
just one of many possible encoding schemes.

-- 
Lloyd Kvam
Venix Corp