[Python-Dev] bytes.from_hex()

Stephen J. Turnbull stephen at xemacs.org
Mon Feb 27 06:59:44 CET 2006


>>>>> "Greg" == Greg Ewing <greg.ewing at canterbury.ac.nz> writes:

    Greg> Stephen J. Turnbull wrote:

    >> I gave you one, MIME processing in email

    Greg> If implementing a mime packer is really the only use case
    Greg> for base64, then it might as well be removed from the
    Greg> standard library, since 99.99999% of all programmers will
    Greg> never touch it.  I don't have any real-life use cases for
    Greg> base64 that a non-mime-implementer might come across, so all
    Greg> I can do is imagine what shape such a use case might have.

I guess we don't have much to talk about, then.

    >> Give me a use case where it matters practically that the output
    >> of the base64 codec be Python unicode characters rather than
    >> 8-bit ASCII characters.

    Greg> I'd be perfectly happy with ascii characters, but in Py3k,
    Greg> the most natural place to keep ascii characters will be in
    Greg> character strings, not byte arrays.

Natural != practical.

Anyway, I disagree, and I've lived with the problems that come with an
environment that mixes objects with various underlying semantics into
a single "text stream" for a decade and a half.

That doesn't make me authoritative, but as we agree to disagree, I
hope you'll keep in mind that someone with real-world experience that
is somewhat relevant[1] to the issue doesn't find that natural at all.

    Greg> Since the Unicode character set is a superset of the ASCII
    Greg> character set, it doesn't seem unreasonable that they could
    Greg> also be thought of as Unicode characters.

I agree.  However, as soon as I go past that intuition to thinking
about what that implies for _operations_ on the base64 string, it
begins to seem unreasonable, unnatural, and downright dangerous.  The
base64 string is a representation of an object that doesn't have text
semantics.  Nor do base64 strings have text semantics: they can't even
be concatenated as text (the pad character '=' is typically a syntax
error in a profile of base64, except as terminal padding).  So if you
wish to concatenate the underlying objects, the base64 strings must be
decoded, concatenated, and re-encoded in the general case.  IMO it's
not worth preserving the very superficial coincidence of "character
representation" in the face of such semantics.

I think that fact that favoring the coincidence of representation
leads you to also deprecate the very natural use of the codec API to
implement and understand base64 is indicative of a deep problem with
the idea of implementing base64 as bytes->unicode.


Footnotes: 
[1]  That "somewhat" is intended literally; my specialty is working
with codecs for humans in Emacs, but I've also worked with more
abstract codecs such as base64 in contexts like email, in both LISP
and Python.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


More information about the Python-Dev mailing list