[Python-Dev] bytes type discussion

Wed Feb 15 21:53:06 CET 2006

On Tue, 14 Feb 2006 19:41:07 -0500, "Raymond Hettinger" <python at rcn.com> wrote:

>[Guido van Rossum]
>> Somewhat controversial:
>>
>> - bytes("abc") == bytes(map(ord, "abc"))
>
>At first glance, this seems obvious and necessary, so if it's somewhat 
>controversial, then I'm missing something.  What's the issue?
>
ord("x") gets the source encoding's ord value of "x", but if that is not unicode
or latin-1, it will break when PY 3000 makes "x" unicode.

This means until Py 3000 plain str string literals have to use ascii and
escapes in order to preserve the meaning when "x" == u"x".

But the good news is bytes(map(ord(u"x"))) works fine for any source encoding
now or after PY 3000. You just have to type characters into your editor
between the quotes that look on the screen like any of the first 256 unicode characters
(or use ascii escapes for unshowables). The u"x" translates x into unicode according
to the *character* of x, whatever the source encoding, so all you have to do is
choose characters of the first 256 unicodes. This happens to be latin-1, but you can ignore that
unless you are interested in the actual byte values. If they have byte meaning, escapes
are clearer anyway, and they work in a unicode string (where "x".decode(source_encoding) might
fail on an illegal character).

The solution is to use u"x" for now or use ascii-only with escapes, and just
map ord on either kind of string. This should work when u"x"
becomes equivalent to "x". The unicode that comes from a current u"x" string
defines a *character* sequence. If you use legal latin-1 *characters* in
whatever source encoding your editor and coding cookie say, you will get
the *characters* you see inside the quotes in the u"..." literal translated
to unicode, and the first 256 characters of unicode happen to be the latin-1 set,
so map ord just works. With a unicode string you don't have to think about encoding,
just use ord/unichr in range(0,256). Hex escapes within unicode strings work as expected,
so IMO it's pretty clean.

I think I have shown this in a couple of other posts in the orignal thread
(where I created and compiled source code in several encodings including utf-8
and comiled with coding cookies and exec'd the result)

I could always have overlooked something, but I am hopeful.

Regards,
Bengt Richter