[Python-Dev] bytes.from_hex()

Mon Feb 27 12:41:25 CET 2006

Stephen J. Turnbull wrote:

>     Greg> I'd be perfectly happy with ascii characters, but in Py3k,
>     Greg> the most natural place to keep ascii characters will be in
>     Greg> character strings, not byte arrays.
> 
> Natural != practical.

That seems to be another thing we disagree about --
to me it seems both natural *and* practical.

The whole business of stuffing binary data down a
text channel is a practicality-beats-purity kind of
thing. You wouldn't do it if you had a real binary
channel available, but if you don't, it's better
than nothing.

> The base64 string is a representation of an object
 > that doesn't have text semantics.

But the base64 string itself *does* have text
semantics. That's the whole point of base64 --
to represent a non-text object *using* text.

To me this is no different than using a string
of decimal digit characters to represent an
integer, or a string of hexadecimal digit
characters to represent a bit pattern. Would
you say that those are not text, either?

What about XML? What would you consider the
proper data type for an XML document to be
inside a Python program -- bytes or text?
I'm genuinely interested in your answer to
that, because I'm trying to understand where
you draw the line between text and non-text.

You seem to want to reserve the term "text" for
data that doesn't ever have to be understood
even a little bit by a computer program, but that
seems far too restrictive to me, and a long
way from established usage.

> Nor do base64 strings have text semantics: they can't even
> be concatenated as text ...  So if you
> wish to concatenate the underlying objects, the base64 strings must be
> decoded, concatenated, and re-encoded in the general case.

You can't add two integers by concatenating
their base-10 character representation, either,
but I wouldn't take that as an argument against
putting decimal numbers into text files.

Also, even if we follow your suggestion and store
our base64-encoded data in byte arrays, we *still*
wouldn't be able to concatenate the original data
just by concatenating those byte arrays. So this
argument makes no sense either way.

> IMO it's not worth preserving the very superficial
 > coincidence of "character representation"

I disagree entirely that it's superficial. On the
contrary, it seems to me to be very essence of
what base64 is all about.

If there's any "coincidence of representation" it's
in the idea of storing the result as ASCII bit patterns
in a byte array, on the assumption that that's probably
how they're going to end up being represented down the
line.

That assumption could be very wrong. What happens if
it turns out they really need to be encoded as UTF-16,
or as EBCDIC? All hell breaks loose, as far as I can
see, unless the programmer has kept very firmly in
mind that there is an implicit ASCII encoding involved.

It's exactly to avoid the need for those kinds of
mental gymnastics that Py3k will have a unified,
encoding-agnostic data type for all character strings.

> I think that fact that favoring the coincidence of representation
> leads you to also deprecate the very natural use of the codec API to
> implement and understand base64 is indicative of a deep problem with
> the idea of implementing base64 as bytes->unicode.

Not sure I'm following you. I don't object to
implementing base64 as a codec, only to exposing
it via the same interface as the "real" unicode
codecs like utf8, etc. I thought we were in
agreement about that.

If you're thinking that the mere fact its input
type is bytes and its output type is characters is
going to lead to its mistakenly appearing via that
interface, that would be a bug or design flaw in
the mechanism that controls which codecs appear
via that interface. It needs to be controlled by
something more than just the input and output
types.

Greg