[Python-Dev] bytes.from_hex()

Stephen J. Turnbull stephen at xemacs.org
Fri Feb 24 12:05:55 CET 2006


>>>>> "Greg" == Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

    Greg> Stephen J. Turnbull wrote:

    >> No, base64 isn't a wire protocol.  It's a family[...].

    Greg> Yes, and it's up to the programmer to choose those code
    Greg> units (i.e. pick an encoding for the characters) that will,
    Greg> in fact, pass through the channel he is using without
    Greg> corruption. I don't see how any of this is inconsistent with
    Greg> what I've said.

It's not.  It just shows that there are other "correct" ways to think
about the issue.

    >> Only if you do no transformations that will harm the
    >> base64-encoding.  ...  It doesn't allow any of the usual
    >> transformations on characters that might be applied globally to
    >> a mail composition buffer, for example.

    Greg> I don't understand that. Obviously if you rot13 your mail
    Greg> message or turn it into pig latin or something, it's going
    Greg> to mess up any base64 it might contain.  But that would be a
    Greg> silly thing to do to a message containing base64.

What "message containing base64"?  "Any base64 in there?"  "Nope,
nobody here but us Unicode characters!"  I certainly hope that in Py3k
bytes objects will have neither ROT13 nor case-changing methods, but
str objects certainly will.  Why give up the safety of that
distinction?

    Greg> Given any piece of text, there are things it makes sense to
    Greg> do with it and things it doesn't, depending entirely on the
    Greg> use to which the text will eventually be put.  I don't see
    Greg> how base64 is any different in this regard.

If you're going to be binary about it, it's not different.  However
the kind of "text" for which Unicode was designed is normally produced
and consumed by people, who wll pt up w/ ll knds f nnsns.  Base64
decoders will not put up with the same kinds of nonsense that people
will.

You're basically assuming that the person who implements the code that
processes a Unicode string is the same person who implemented the code
that converts a binary object into base64 and inserts it into a
string.  I think that's a dangerous (and certainly invalid) assumption.

I know I've lost time and data to applications that make assumptions
like that.  In fact, that's why "MULE" is a four-letter word in Emacs
channels.<wink>

    >> So then you bring it right back in with base64.  Now they need
    >> to know about bytes<->unicode codecs.

    Greg> No, they need to know about the characteristics of the
    Greg> channel over which they're sending the data.

I meant it in a trivial sense: "How do you use a bytes<->unicode codec
properly without knowing that it's a bytes<->unicode codec?"

In most environments, it should be possible to hide bytes<->unicode
codecs almost all the time, and I think that's a very good thing.  I
don't think it's a good idea to gratuitously introduce wire protocols
as unicode codecs, even if a class of bit patterns which represent the
integer 65 are denoted "A" in various sources.  Practicality beats
purity (especially when you're talking about the purity of a pregnant
virgin).

    Greg> It might be appropriate to to use base64 followed by some
    Greg> encoding, but the programmer needs to be aware of that and
    Greg> choose the encoding wisely. It's not possible to shield him
    Greg> from having to know about encodings in that situation, even
    Greg> if the encoding is just ascii.

What do you think the email module does?  Assuming conforming MIME
messages and receivers capable of handling UTF-8, the user of the
email module does not need to know anything about any encodings at
all.  With a little more smarts, the email module could even make a
good choice of output encoding based on the _language_ of the text,
removing the restriction to UTF-8 on the output side, too.  With the
aid of file(1), it can make excellent guesses about attachments.

Sure, the email module programmer needs to know, but the email module
programmer needs to know an awful lot about codecs anyway, since mail
at that level is a binary channel, while users will be throwing a
mixed bag of binary and textual objects at it.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


More information about the Python-Dev mailing list