[Python-Dev] Why can't I encode/decode base64 without importing a module?

Tue Apr 23 16:16:01 CEST 2013

On Tue, 23 Apr 2013 22:29:33 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
> R. David Murray writes:
> 
>  > You transform *into* the encoding, and untransform *out* of the
>  > encoding.  Do you have an example where that would be ambiguous?
> 
> In the bytes-to-bytes case, any pair of character encodings (eg, UTF-8
> and ISO-8859-15) would do.  Or how about in text, ReST to HTML?

If I write:

  bytestring.transform('ISO-8859-15')

that would indeed be ambiguous, but only because I haven't named the
source encoding of the bytestring.  So the above is obviously
nonsense, and the easiest "fix" is to have the things that are currently
bytes-to-text or text-to-bytes character set transformations *only*
work with encode/decode, and not transform/untransform.

> BASE64 itself is ambiguous.  By RFC specification, BASE64 is a
> *textual* representation of arbitrary binary data.  (Cf. URIs.)  The
> natural interpretation of .encode('base64') in that context would be
> as a bytes-to-text encoder.  However, this has several problems.  In
> practice, we invariably use an ASCII octet stream to carry BASE64-
> encoded data.  So web developers would almost certainly expect a
> bytes-to-bytes encoder.  Such a bytes-to-bytes encoder can't be
> duck-typed.  Double-encoding bugs wouldn't be detected until the
> stream arrives at the user.  And the RFC-based signature of
> .encode('base64') as bytes-to-text is precisely opposite to that of
> .encode('utf-8') (text-to-bytes).

I believe that after much discussion we have settled on these
transformations (in their respective modules) accepting either bytes
or strings as input for decoding, only bytes as input for encoding,
and *always* producing bytes as output.  (Note that the base64 docs need
some clarification about this.)

Given this, the possible valid transformations would be:

  bytestring.transform('base64')
  bytesstring.untransform('base64')
  string.untransform('base64')

and all would produce a byte string.  That byte string would be in
base64 for the first one, and a decoded binary string for the second two.

Given our existing API, I don't think we want

  string.encode('base64')

to work (taking an ascii-only unicode string and returning bytes), and
we've already agreed that adding a 'decode' method to string is not
going to happen.

We could, however, and quite possibly should, disallow

  string.untransform('base64')

even though the underly module supports it.  Thus we would only have
bytes-to-bytes transformations for 'base64' and its siblings, and you
would write the unicode-ascii-to-bytes transformation as:

  string.encode('ascii').untransform('base64')

which has some pedagogical value :).

If you do transform('base64') on a bytestring already encoded as base64
you get a double encoding, yes.  I don't see that it is our responsibility
to try to protect you from this mistake.  The module functions certainly
don't.

Given that, is there anything ambiguous about the proposed API?

(Note: if you would like to argue that, eg, base64.b64encode or
binascii.b2a_base64 should return a string, it is too late for that
argument for backward compatibility reasons.)

> It is certainly true that there are many unambiguous cases.  In the
> case of a true text processing facility (eg, Emacs buffers or Python 3
> str) where there is an unambiguous text type with a constant and
> opaque internal representation, it makes a lot of sense to treat the
> text type as special/central, and use the terminology "encode [from
> text]" and "decode [to text]".  It's easy to remember, which one is
> special is obvious, and the difference in input and output types means
> that mistaken use of the API will be detected by duck-typing.
> 
> However, in the case of bytes-bytes or text-text transformations, it's
> not the presence of unambiguous cases that should drive API design
> IMO.  It's the presence of the ambiguous cases that we should cater
> to.  I don't see easy solutions to this issue.

When I asked about ambiguous cases, I was asking for cases where the
meaning of "transform('somecodec')" was ambiguous.  Sure, it is possible
to feed the wrong input into that transformation, but I consider that a
programming error, not an ambiguity in the API.  After all, you have
exactly the same problem if you use the module functions directly,
which is currently the only option.

--David