[Python-Dev] bytes.from_hex()

Thu Feb 23 07:05:43 CET 2006

>>>>> "Greg" == Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

    Greg> Stephen J. Turnbull wrote:

    >> Base64 is a (family of) wire protocol(s).  It's not clear to me
    >> that it makes sense to say that the alphabets used by "baseNN"
    >> encodings are composed of characters,

    Greg> Take a look at [this that the other]

Those references use "character" in an ambiguous and ill-defined way.
Trying to impose Python unicode object semantics on "vague characters"
is a bad idea IMO.

    Greg> Which seems to make it perfectly clear that the result of
    Greg> the encoding is to be considered as characters, which are
    Greg> not necessarily going to be encoded using ascii.

Please define "character," and explain how its semantics map to
Python's unicode objects.

    Greg> So base64 on its own is *not* a wire protocol. Only after
    Greg> encoding the characters do you have a wire protocol.

No, base64 isn't a wire protocol.  Rather, it's a schema for a family
of wire protocols, whose alphabets are heuristically chosen on the
assumption that code units which happen to correspond to alpha-numeric
code points in a commonly-used coded character set are more likely to
pass through a communication channel without corruption.

Note that I have _precisely_ defined what I mean.  You still have the
problem that you haven't defined character, and that is a real
problem, see below.

    >> I don't see any case for "correctness" here, only for
    >> convenience,

    Greg> I'm thinking of convenience, too. Keep in mind that in Py3k,
    Greg> 'unicode' will be called 'str' (or something equally neutral
    Greg> like 'text') and you will rarely have to deal explicitly
    Greg> with unicode codings, this being done mostly for you by the
    Greg> I/O objects. So most of the time, using base64 will be just
    Greg> as convenient as it is today: base64_encode(my_bytes) and
    Greg> write the result out somewhere.

Convenient, yes, but incorrect.  Once you mix those bytes with the
Python string type, they become subject to all the usual operations on
characters, and there's no way for Python to tell you that you didn't
want to do that.  Ie,

    Greg> Whereas if the result is text, the right thing happens
    Greg> automatically whatever the ultimate encoding turns out to
    Greg> be. You can take the text from your base64 encoding, combine
    Greg> it with other text from any other source to form a complete
    Greg> mail message or xml document or whatever, and write it out
    Greg> through a file object that's using any unicode encoding at
    Greg> all, and the result will be correct.

Only if you do no transformations that will harm the base64-encoding.
This is why I say base64 is _not_ based on characters, at least not in
the way they are used in Python strings.  It doesn't allow any of the
usual transformations on characters that might be applied globally to
a mail composition buffer, for example.

In other words, you don't escape from the programmer having to know
what he's doing.  EIBTI, and the setup I advocate forces the
programmer to explicitly decide where to convert base64 objects to a
textual representation.  This reminds him that he'd better not touch
that text.

    Greg> The reason I say it's *corrrect* is that if you go straight
    Greg> from bytes to bytes, you're *assuming* the eventual encoding
    Greg> is going to be an ascii superset.  The programmer is going
    Greg> to have to know about this assumption and understand all its
    Greg> consequences and decide whether it's right, and if not, do
    Greg> something to change it.

I'm not assuming any such thing, except in the context of analysis of
implementation efficiency.  And the programmer needs to know about the
semantics of text that is actually a base64-encoded object, and that
they are different from string semantics.

This is something that programmers are used to dealing with in the
case of Python 2.x str and C char[]; the whole point of the unicode
type is to allow the programmer to abstract from that when dealing
human-readable text.  Why confuse the issue.

    >> And in the classroom, you're just going to confuse students by
    >> telling them that UTF-8 --[Unicode codec]--> Python string is
    >> decoding but UTF-8 --[base64 codec]--> Python string is
    >> encoding, when MAL is telling them that --> Python string is
    >> always decoding.

    Greg> Which is why I think that only *unicode* codings should be
    Greg> available through the .encode and .decode interface. Or
    Greg> alternatively there should be something more explicit like
    Greg> .unicode_encode and .unicode_decode that is thus restricted.

    Greg> Also, if most unicode coding is done in the I/O objects,
    Greg> there will be far less need for programmers to do explicit
    Greg> unicode coding in the first place, so likely it will become
    Greg> more of an advanced topic, rather than something you need to
    Greg> come to grips with on day one of using unicode, like it is
    Greg> now.

So then you bring it right back in with base64.  Now they need to know
about bytes<->unicode codecs.

Of course it all comes down to a matter of judgment.  I do find your
position attractive, but I just don't think it will work for naive
users the way you think it will.  It's also possible to make a precise
statement of the rationale for my approach, which I have not been able
to achieve for the "base64 uses characters" approach, and nobody else
has demonstrated one, yet.

On the other hand, I don't think either approach imposes substantially
more burden on the advanced programmer, nor does either proposal
involve a specific restriction on usage (aka "dumbing down the
language").

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.