[Python-Dev] Why can't I encode/decode base64 without importing a module?

Tue Apr 23 18:49:39 CEST 2013

R. David Murray writes:
 > On Tue, 23 Apr 2013 22:29:33 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
 > > R. David Murray writes:
 > > 
 > >  > You transform *into* the encoding, and untransform *out* of the
 > >  > encoding.  Do you have an example where that would be ambiguous?
 > > 
 > > In the bytes-to-bytes case, any pair of character encodings (eg, UTF-8
 > > and ISO-8859-15) would do.  Or how about in text, ReST to HTML?
 > 
 > If I write:
 > 
 >   bytestring.transform('ISO-8859-15')
 > 
 > that would indeed be ambiguous, but only because I haven't named the
 > source encoding of the bytestring.  So the above is obviously
 > nonsense, and the easiest "fix" is to have the things that are currently
 > bytes-to-text or text-to-bytes character set transformations *only*
 > work with encode/decode, and not transform/untransform.

I think you're completely missing my point here.  The problem is that
in the cases I mention, what is encoded data and what is decoded data
can only be decided by asking the user.

 > I believe that after much discussion we have settled on these
 > transformations (in their respective modules) accepting either bytes
 > or strings as input for decoding, only bytes as input for encoding,
 > and *always* producing bytes as output.

Which, of course, is quite broken from the point of view of the RFC!
Of course, the RFC be damned[1], for the purposes of the Python
stdlib, the specific codecs used for Content-Transfer-Encoding have a
clear intuitive directionality, and their encoding methods should turn
bytes into bytes (and str or bytes into bytes on decoding).

Nevertheless, it's not TOOWTDI, it's a careful compromise.

 > Given this, the possible valid transformations would be:
 > 
 >   bytestring.transform('base64')
 >   bytesstring.untransform('base64')
 >   string.untransform('base64')

Which is an obnoxious API, since (1) you've now made it impossible to
use "transform" for

    bytestring.transform(from='utf-8', to='iso-8859-1')
    bytestring.transform(from='ulaw', to='mp3')
    textstring.transform(from='rest', to='html')

without confusion, and (2) the whole world is going to wonder why you
don't use .encode and .decode instead of .transform and .untransform.

The idea in the examples is that we could generalize the codec
registry to look up codecs by pairs of media-types.  I'm not sure this
makes sense ... much of the codec API presumes a stream, especially
the incremental methods.  But many MIME media types are streams only
because they're serializations, incremental en/decoding is nonsense.

So I suppose I would want to write

    bytestring.transform(from='octet-stream', to='BASE64')

for this hypothetical API.  (I suspect that in practice the
'application/octet-stream' media type would be spelled 'bytes', of
course.)  This kind of API could be used to improve the security of
composition of transforms.  In the case of BASE64, it would make sense
to match anything at all as the other type (as long as it's
represented in Python by a bytes object).  So it would be possible to
do

    object = bytestring.transform(from='BASE64', to='PNG')

giving object a media_type attribute such that

    object.decode('iso-8859-1')

would fail.  (This would require changes to the charset codecs, to pay
heed to the media_type attribute, so it's not immediately feasible.)

 > and all would produce a byte string.  That byte string would be in
 > base64 for the first one, and a decoded binary string for the second two.
 > 
 > Given our existing API, I don't think we want
 > 
 >   string.encode('base64')
 > 
 > to work (taking an ascii-only unicode string and returning bytes),

No, we don't, but for reasons that have little to do with "ASCII-only".
The problem with supporting that idiom is that *people can't read
strs* [in the Python 3 internal representation] -- they can only read
a str that has been encoded implicitly into the PYTHONIOENCODING or
explicitly to an explicitly requested encoding.  So the usage above is
clearly ambiguous.  Even if it is ASCII-only, in theory the user could
want EBCDIC.

 > If you do transform('base64') on a bytestring already encoded as
 > base64 you get a double encoding, yes.  I don't see that it is our
 > responsibility to try to protect you from this mistake.  The module
 > functions certainly don't.
 > 
 > Given that, is there anything ambiguous about the proposed API?

Not for BASE64.  But what's so special about BASE64 that it deserves a
new method name for the same old idiom, using a word that's an obvious
candidate for naming a more general idiom?

 > (Note: if you would like to argue that, eg, base64.b64encode or
 > binascii.b2a_base64 should return a string, it is too late for that
 > argument for backward compatibility reasons.)

Even if it weren't too late, the byte-shoveling lobby is way too
strong; that's not a winnable agument.

 > When I asked about ambiguous cases, I was asking for cases where
 > the meaning of "transform('somecodec')" was ambiguous.

If "transform('somecodec')" isn't ambiguous, you really really want to
spell it "encode" instead of "transform" IMO.  Even though I don't see
how to do that without generating more confusion than it's worth at
the moment, I still harbor the hope that somebody will come up with a
way to do it so everything still fits together.

Footnotes: 
[1]  I am *not* one to damn RFCs lightly!