[Python-Dev] bytes.from_hex()

Wed Feb 22 12:35:39 CET 2006

Stephen J. Turnbull wrote:

> Base64 is a (family of) wire protocol(s).  It's not clear to me that
> it makes sense to say that the alphabets used by "baseNN" encodings
> are composed of characters,

Take a look at

   http://en.wikipedia.org/wiki/Base64

where it says

   ...base64 is a binary to text encoding scheme whereby an
   arbitrary sequence of bytes is converted to a sequence of
   printable ASCII characters.

Also see RFC 2045 (http://www.ietf.org/rfc/rfc2045.txt) which
defines base64 in terms of an encoding from octets to characters,
and also says

   A 65-character subset of US-ASCII is used ... This subset has
   the important property that it is represented identically in
   all versions of ISO 646 ... and all characters in the subset
   are also represented identically in all versions of EBCDIC.

Which seems to make it perfectly clear that the result
of the encoding is to be considered as characters, which
are not necessarily going to be encoded using ascii.

So base64 on its own is *not* a wire protocol. Only after
encoding the characters do you have a wire protocol.

> I don't see any case for
> "correctness" here, only for convenience,

I'm thinking of convenience, too. Keep in mind that in Py3k,
'unicode' will be called 'str' (or something equally neutral
like 'text') and you will rarely have to deal explicitly with
unicode codings, this being done mostly for you by the I/O
objects. So most of the time, using base64 will be just as
convenient as it is today: base64_encode(my_bytes) and write
the result out somewhere.

The reason I say it's *corrrect* is that if you go straight
from bytes to bytes, you're *assuming* the eventual encoding
is going to be an ascii superset. The programmer is going to
have to know about this assumption and understand all its
consequences and decide whether it's right, and if not, do
something to change it.

Whereas if the result is text, the right thing happens
automatically whatever the ultimate encoding turns out to
be. You can take the text from your base64 encoding, combine
it with other text from any other source to form a complete
mail message or xml document or whatever, and write it out
through a file object that's using any unicode encoding
at all, and the result will be correct.

 > it's also efficient to use bytes<->bytes for XML, since
> conversion of base64 bytes to UTF-8 characters is simply a matter of
> "Simon says, be UTF-8!"

Efficiency is an implementation concern. In Py3k, strings
which contain only ascii or latin-1 might be stored as
1 byte per character, in which case this would not be an
issue.

> And in the classroom, you're just going to confuse students by telling
> them that UTF-8 --[Unicode codec]--> Python string is decoding but
> UTF-8 --[base64 codec]--> Python string is encoding, when MAL is
> telling them that --> Python string is always decoding.

Which is why I think that only *unicode* codings should be
available through the .encode and .decode interface. Or
alternatively there should be something more explicit like
.unicode_encode and .unicode_decode that is thus restricted.

Also, if most unicode coding is done in the I/O objects, there
will be far less need for programmers to do explicit unicode
coding in the first place, so likely it will become more of
an advanced topic, rather than something you need to come to
grips with on day one of using unicode, like it is now.

--
Greg