[Tutor] UTF-8 filenames encountered in os.walk

Wed Jul 4 20:47:45 CEST 2007

Terry Carroll wrote:
> I'm pretty iffy on this stuff myself, but as I see it, you basically have 
> three kinds of things here.
> 
> First, an ascii string:
> 
>   s = 'abc'
> 
> In hex, this is 616263; 61 for 'a'; 62 for 'b', 63 for 'c'.
> 
> Second, a unicode string:
> 
>   u = u'abc' 
> 
> I can't say what this is "in hex" because that's not meaningful.  A 
> Unicode character is a code point, which can be represented in a variety 
> of ways, depending on the encoding used.  So, moving on....
> 
> Finally, you can have a sequence of bytes, which are stored in a string as 
> a buffer, that shows the particular encoding of a particular string:
> 
>   e8 = s.encode("UTF-8")
>   e16 = s.encode("UTF-16") 
> 
> Now, e8 and e16 are each strings (of bytes), the content of which tells
> you how the string of characters that was encoded is represented in that 
> particular encoding.

I would say that there are two kinds of strings, byte strings and 
unicode strings. Byte strings have an implicit encoding. If the contents 
of the byte string are all ascii characters, you can generally get away 
with ignoring that they are in an encoding, because most of the common 
8-bit character encodings include plain ascii as a subset (all the 
latin-x encodings, all the Windows cp12xx encodings, and utf-8 all have 
ascii as a subset), so an ascii string can be interpreted as any of 
those encodings without error. As soon as you get away from ascii, you 
have to be aware of the encoding of the string.

encode() really wants a unicode string not a byte string. If you call 
encode() on a byte string, the string is first converted to unicode 
using the default encoding (usually ascii), then converted with the 
given encoding.
> 
> In hex, these look like this.
> 
>   e8: 616263 (61 for 'a'; 62 for 'b', 63 for 'c')
>   e16: FFFE6100 62006300
>      (FFEE for the BOM, 6100 for 'a', 6200 for 'b', 6300 for 'c')
> 
> Now, superficially, s and e8 are equal, because for plain old ascii 
> characters (which is all I've used in this example), UTF-8 is equivalent 
> to ascii.  And they compare the same:
> 
>>>> s == e8
> True

They are equal in every sense, I don't know why you consider this 
superficial. And if your original string was not ascii the encode() 
would fail with a UnicodeDecodeError.
> 
> But that's not true of the UTF-16:
> 
>>>> s == e16
> False
>>>> e8 == e16
> False
> 
> So (and I'm open to correction on this), I think of the encode() method as 
> returning a string of bytes that represents the particular encoding of a 
> string value -- and it can't be used as the string value itself.

The idea that there is somehow some kind of string value that doesn't 
have an encoding will bring you a world of hurt as soon as you venture 
out of the realm of pure ascii. Every string is a particular encoding of 
character values. It's not any different from "the string value itself".
> 
> But you can get that string value back (assuming all the characters map 
> to ascii):
> 
>>>> s8 = e8.decode("UTF-8")
>>>> s16 = e16.decode("UTF-16")
>>>> s == s8 == s16
> True

You can get back to the ascii-encoded representation of the string. 
Though here you are hiding something - s8 and s16 are unicode strings 
while s is a byte string.

In [13]: s = 'abc'
In [14]: e8 = s.encode("UTF-8")
In [15]: e16 = s.encode("UTF-16")
In [16]: s8 = e8.decode("UTF-8")
In [17]: s16 = e16.decode("UTF-16")
In [18]: s8
Out[18]: u'abc'
In [19]: s16
Out[19]: u'abc'
In [20]: s
Out[20]: 'abc'
In [21]: type(s8) == type(s)
Out[21]: False

The way I think of it is, unicode is the "pure" representation of the 
string. (This is nonsense, I know, but I find it a convenient mnemonic.) 
encode() converts from the "pure" representation to an encoded 
representation. The encoding can be ascii, latin-1, utf-8... decode() 
converts from the coded representation back to the "pure" one.

Kent