[Python-Dev] bytes.from_hex()

Fri Feb 24 01:25:50 CET 2006

Stephen J. Turnbull wrote:

> Please define "character," and explain how its semantics map to
> Python's unicode objects.

One of the 65 abstract entities referred to in the RFC
and represented in that RFC by certain visual glyphs.
There is a subset of the Unicode code points that
are conventionally associated with very similar glyphs,
so that there is an obvious one-to-one mapping between
these entities and those Unicode code points. These
entities therefore have a natural and obvious
representation using Python unicode strings.

> No, base64 isn't a wire protocol.  Rather, it's a schema for a family
> of wire protocols, whose alphabets are heuristically chosen on the
> assumption that code units which happen to correspond to alpha-numeric
> code points in a commonly-used coded character set are more likely to
> pass through a communication channel without corruption.

Yes, and it's up to the programmer to choose those code
units (i.e. pick an encoding for the characters) that
will, in fact, pass through the channel he is using
without corruption. I don't see how any of this is
inconsistent with what I've said.

> Only if you do no transformations that will harm the base64-encoding.
> ...  It doesn't allow any of the
> usual transformations on characters that might be applied globally to
> a mail composition buffer, for example.

I don't understand that. Obviously if you rot13 your
mail message or turn it into pig latin or something,
it's going to mess up any base64 it might contain.
But that would be a silly thing to do to a message
containing base64.

Given any piece of text, there are things it makes
sense to do with it and things it doesn't, depending
entirely on the use to which the text will eventually
be put. I don't see how base64 is any different in
this regard.

> So then you bring it right back in with base64.  Now they need to know
> about bytes<->unicode codecs.

No, they need to know about the characteristics of
the channel over which they're sending the data.

Base64 is designed for situations in which you
have a *text* channel that you know is capable of
transmitting at least a certain subset of characters,
where "character" means whatever is used as input
to that channel.

In Py3k, text will be represented by unicode strings.
So a Py3k text channel should take unicode as its
input, not bytes.

I think we've got a bit sidetracked by talking about
mime. I wasn't actually thinking about mime, but
just a plain text message into which some base64
data was being inserted. That's the way we used to
do things in the old days with uuencode etc, before
mime was invented.

Here, the "channel" is NOT the socket or whatever
that the ultimate transmission takes place over --
it's the interface to your mail sending software
that takes a piece of plain text and sends it off
as a mail message somehow.

In Py3k, if a channel doesn't take unicode as input,
then it's not a text channel, and it's not appropriate
to be using base64 with it directly. It might be
appropriate to to use base64 followed by some encoding,
but the programmer needs to be aware of that and
choose the encoding wisely. It's not possible to
shield him from having to know about encodings in
that situation, even if the encoding is just ascii.
Trying to do so will just lead to more confusion,
in my opinion.

Greg