[Tutor] UTF-8 filenames encountered in os.walk
kent37 at tds.net
Wed Jul 4 20:47:45 CEST 2007
Terry Carroll wrote:
> I'm pretty iffy on this stuff myself, but as I see it, you basically have
> three kinds of things here.
> First, an ascii string:
> s = 'abc'
> In hex, this is 616263; 61 for 'a'; 62 for 'b', 63 for 'c'.
> Second, a unicode string:
> u = u'abc'
> I can't say what this is "in hex" because that's not meaningful. A
> Unicode character is a code point, which can be represented in a variety
> of ways, depending on the encoding used. So, moving on....
> Finally, you can have a sequence of bytes, which are stored in a string as
> a buffer, that shows the particular encoding of a particular string:
> e8 = s.encode("UTF-8")
> e16 = s.encode("UTF-16")
> Now, e8 and e16 are each strings (of bytes), the content of which tells
> you how the string of characters that was encoded is represented in that
> particular encoding.
I would say that there are two kinds of strings, byte strings and
unicode strings. Byte strings have an implicit encoding. If the contents
of the byte string are all ascii characters, you can generally get away
with ignoring that they are in an encoding, because most of the common
8-bit character encodings include plain ascii as a subset (all the
latin-x encodings, all the Windows cp12xx encodings, and utf-8 all have
ascii as a subset), so an ascii string can be interpreted as any of
those encodings without error. As soon as you get away from ascii, you
have to be aware of the encoding of the string.
encode() really wants a unicode string not a byte string. If you call
encode() on a byte string, the string is first converted to unicode
using the default encoding (usually ascii), then converted with the
> In hex, these look like this.
> e8: 616263 (61 for 'a'; 62 for 'b', 63 for 'c')
> e16: FFFE6100 62006300
> (FFEE for the BOM, 6100 for 'a', 6200 for 'b', 6300 for 'c')
> Now, superficially, s and e8 are equal, because for plain old ascii
> characters (which is all I've used in this example), UTF-8 is equivalent
> to ascii. And they compare the same:
>>>> s == e8
They are equal in every sense, I don't know why you consider this
superficial. And if your original string was not ascii the encode()
would fail with a UnicodeDecodeError.
> But that's not true of the UTF-16:
>>>> s == e16
>>>> e8 == e16
> So (and I'm open to correction on this), I think of the encode() method as
> returning a string of bytes that represents the particular encoding of a
> string value -- and it can't be used as the string value itself.
The idea that there is somehow some kind of string value that doesn't
have an encoding will bring you a world of hurt as soon as you venture
out of the realm of pure ascii. Every string is a particular encoding of
character values. It's not any different from "the string value itself".
> But you can get that string value back (assuming all the characters map
> to ascii):
>>>> s8 = e8.decode("UTF-8")
>>>> s16 = e16.decode("UTF-16")
>>>> s == s8 == s16
You can get back to the ascii-encoded representation of the string.
Though here you are hiding something - s8 and s16 are unicode strings
while s is a byte string.
In : s = 'abc'
In : e8 = s.encode("UTF-8")
In : e16 = s.encode("UTF-16")
In : s8 = e8.decode("UTF-8")
In : s16 = e16.decode("UTF-16")
In : s8
In : s16
In : s
In : type(s8) == type(s)
The way I think of it is, unicode is the "pure" representation of the
string. (This is nonsense, I know, but I find it a convenient mnemonic.)
encode() converts from the "pure" representation to an encoded
representation. The encoding can be ascii, latin-1, utf-8... decode()
converts from the coded representation back to the "pure" one.
More information about the Tutor