[Tutor] UTF-8 filenames encountered in os.walk

Wed Jul 4 19:44:44 CEST 2007

On Wed, 4 Jul 2007, William O'Higgins Witteman wrote:

> >It is nonsense to talk about 'recasting' an ascii string as UTF-8; an 
> >ascii string is *already* UTF-8 because the representation of the 
> >characters is identical. OTOH it makes sense to talk about converting an 
> >ascii string to a unicode string.
> 
> Then what does mystring.encode("UTF-8") do?

I'm pretty iffy on this stuff myself, but as I see it, you basically have 
three kinds of things here.

First, an ascii string:

  s = 'abc'

In hex, this is 616263; 61 for 'a'; 62 for 'b', 63 for 'c'.

Second, a unicode string:

  u = u'abc' 

I can't say what this is "in hex" because that's not meaningful.  A 
Unicode character is a code point, which can be represented in a variety 
of ways, depending on the encoding used.  So, moving on....

Finally, you can have a sequence of bytes, which are stored in a string as 
a buffer, that shows the particular encoding of a particular string:

  e8 = s.encode("UTF-8")
  e16 = s.encode("UTF-16") 

Now, e8 and e16 are each strings (of bytes), the content of which tells
you how the string of characters that was encoded is represented in that 
particular encoding.

In hex, these look like this.

  e8: 616263 (61 for 'a'; 62 for 'b', 63 for 'c')
  e16: FFFE6100 62006300
     (FFEE for the BOM, 6100 for 'a', 6200 for 'b', 6300 for 'c')

Now, superficially, s and e8 are equal, because for plain old ascii 
characters (which is all I've used in this example), UTF-8 is equivalent 
to ascii.  And they compare the same:

>>> s == e8
True

But that's not true of the UTF-16:

>>> s == e16
False
>>> e8 == e16
False

So (and I'm open to correction on this), I think of the encode() method as 
returning a string of bytes that represents the particular encoding of a 
string value -- and it can't be used as the string value itself.

But you can get that string value back (assuming all the characters map 
to ascii):

>>> s8 = e8.decode("UTF-8")
>>> s16 = e16.decode("UTF-16")
>>> s == s8 == s16
True