[Python-Dev] bytes.from_hex()

Stephen J. Turnbull stephen at xemacs.org
Wed Feb 22 10:48:16 CET 2006


>>>>> "Greg" == Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

    Greg> Stephen J. Turnbull wrote:

    >> What I advocate for Python is to require that the standard
    >> base64 codec be defined only on bytes, and always produce
    >> bytes.

    Greg> I don't understand that. It seems quite clear to me that
    Greg> base64 encoding (in the general sense of encoding, not the
    Greg> unicode sense) takes binary data (bytes) and produces
    Greg> characters.

Base64 is a (family of) wire protocol(s).  It's not clear to me that
it makes sense to say that the alphabets used by "baseNN" encodings
are composed of characters, but suppose we stipulate that.

    Greg> So in Py3k the correct usage would be [bytes<->unicode].

IMHO, as a wire protocol, base64 simply doesn't care what Python's
internal representation of characters is.  I don't see any case for
"correctness" here, only for convenience, both for programmers on the
job and students in the classroom.  We can choose the character set
that works best for us.  I think that's 8-bit US ASCII.

My belief is that bytes<->bytes is going to be the dominant use case,
although I don't use binary representation in XML.  However, AFAIK for
on the wire use UTF-8 is strongly recommended for XML, and in that
case it's also efficient to use bytes<->bytes for XML, since
conversion of base64 bytes to UTF-8 characters is simply a matter of
"Simon says, be UTF-8!"

And in the classroom, you're just going to confuse students by telling
them that UTF-8 --[Unicode codec]--> Python string is decoding but
UTF-8 --[base64 codec]--> Python string is encoding, when MAL is
telling them that --> Python string is always decoding.

Sure, it all makes sense if you already know what's going on.  But I
have trouble remembering, especially in cases like UTF-8 vs UTF-16
where Perl and Python have opposite internal representations, and
glibc has a third which isn't either.  If base64 (and gzip, etc) are
all considered bytes<->bytes, there just isn't an issue any more.  The
simple rule wins: to Python string is always decoding.

Why fight it when we can run away with efficiency gains?<wink>

(In the above, "Python string" means the unicode type, not str.)

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


More information about the Python-Dev mailing list