[Python-Dev] Why does base64 return bytes?

Stephen J. Turnbull stephen at xemacs.org
Tue Jun 14 22:58:04 EDT 2016

Greg Ewing writes:

 > The RFC does *not* do that. It describes the output in terms of
 > characters, and does not specify any bit patterns for the
 > output.

The RFC is unclear on this point, but I read it as specifying the
ASCII coded character set, not the ASCII repertoire of (abstract)
characters.  Therefore, it specifies an invertible mapping from a
particular set of integers to characters.

 > The intention is clearly to represent binary data as *text*.

It's more subtle than that.  *RFCs do not deal with text.*  Text is
an internal concept of (some) programming environments.  RFCs may
deal with *encoded text*, and RFC 4648 indeed specifically mentions
"encoded characters" as the output of the BASE64 algorithm.[1]

The intention then is to represent binary data with *binary data that
may be conveniently interpreted as text* (ie, without reencoding), eg,
by a terminal or a printer.[2]  It is also desirable that it be likely
to pass unscathed through channels that are not necessarily even 7-bit
clean (file system directories and JIS X 0201, for example) which
*inadvertantly* treat it as text.  Both requirements are conveniently
fulfilled by using appropriate ASCII subsets, and encoding on the wire
using the usual bit patterns.  However, I suppose you could also use
EBCDIC or UTF-16, as long as you have agreed with the receiver to do

So I would say that Python can do what it wants with the type that
base64.b64encode returns as far as the RFC is concerned; that's an
internal aspect of Python.  It's purely a matter of our convenience
(as programmer *in* Python) whether we return str or bytes.

My own experience is biased toward email and web (not to be confused
with SMTP and HTTP), and so my experience is that most composers
(1) automatically handle text encodings for the users, and then the
content transfer encoding as necessary for the underlying protocol,
and (2) handle attachments by placing a reference in the composed
content, which is replaced by the object just before transmission (and
any desired content transfer encoding is applied at that time, at the
option of the composing agent, which rarely needs to bother the user
with such trivia).  Bytes seem more convenient to me, and give an on-
the-wire representation consistent with that of Python 2 str.

[1]  Admittedly, RFC 3986 (URIs) does stretch the notion of "encoded
text" to the breaking point by including marks on paper.

[2]  Thus, BASE64-encoding resources provides a more efficient,
alternative datagram protocol for the physical links used by RFC 1149

More information about the Python-Dev mailing list