[Python-Dev] Why does base64 return bytes?

Tue Jun 14 14:44:47 EDT 2016

Steven D'Aprano writes:

 > base64.b64encode take bytes as input and returns bytes. Some people are 
 > arguing that this is wrong behaviour, as RFC 3548

That RFC is obsolete: the replacement is RFC 4648.  However, the text
is essentially unchanged.

 > specifies that Base64  should transform bytes to characters:

Without defining "character" except as a "subset" of ASCII.  That
omission is evidently deliberate.  Unfortunately the RFC is unclear
whether a subset of the ASCII repertoire of (abstract) characters is
meant, or a subset of the ASCII codes.  I believe the latter is meant,
but either way, it does refer to *encoded* characters as the output of
the encoding process:

 >     The encoding process represents 24-bit groups of input bits 
 >     as output strings of 4 encoded characters. 

and I see no reason to deny that the bytes output by base64.b64encode
are the octets representing the ASCII codes for the characters of the
BASE64 alphabet.

 > Are they misinterpreting the standard?

I think they are.  As I understand it, the intention of the standard
in using "character" to denote the code unit is similar to that of RFC
3986: BASE encodings are intended to be printable and recognizable to
humans.  If you're using a non-ASCII-superset encoding such as EBCDIC
for text I/O, then you should translate from ASCII to that encoding
for display, and in the (unlikely) case that a human types BASE
encoding from the terminal, the reverse transformation is necessary.

 > Has Python got it wrong?

I can't see anything in the RFC that suggests that.  And, in the end,
an RFC is not concerned with Python's internal fiddling, but rather
with what goes out over the wire.  All of the implementations you
mention will eventually send to the wire octets that are interpreted
as ASCII-encoded characters according to their integer values.

 > Is there a good reason for returning bytes?

I suppose practicality over purity: BASE encodings are normally used
on the wire, and so programs need to encode text to appropriately
encoded octets *before* BASE encoding, and then normally immediately
put the BASE-encoded content on the wire.  Why round-trip from UTF-8
bytes to a str in BASE64 representation, and then do the (trivial)
conversion back to bytes?  OK, it's not that expensive, but still...