[Python-Dev] Why does base64 return bytes?

R. David Murray rdmurray at bitdance.com
Tue Jun 14 14:05:55 EDT 2016

On Tue, 14 Jun 2016 14:05:19 -0300, "Joao S. O. Bueno" <jsbueno at python.org.br> wrote:
> On 14 June 2016 at 13:32, Toshio Kuratomi <a.badger at gmail.com> wrote:
> >
> > On Jun 14, 2016 8:32 AM, "Joao S. O. Bueno" <jsbueno at python.org.br> wrote:
> >>
> >> On 14 June 2016 at 12:19, Steven D'Aprano <steve at pearwood.info> wrote:
> >> > Is there
> >> > a good reason for returning bytes?
> >>
> >> What about: it returns 0-255 numeric values for each position in  a
> >> stream, with
> >> no clue whatsoever to how those values map to text characters beyond
> >> the 32-128 range?
> >>
> >> Maybe base64.decode could take a "encoding" optional parameter - or
> >> there could  be
> >> a separate 'decote_to_text" method that would explicitly take a text codec
> >> name.
> >> Otherwise, no, you simply can't take a bunch of bytes and say they
> >> represent text.
> >>
> > Although it's not explicit, the question seems to be about the output of
> > encoding (and for symmetry, the input of decoding).  In both of those cases,
> > valid output will consist only of ascii characters.
> >
> > The input to encoding would have to remain bytes (that's the main purpose of
> > base64... to turn bytes into an ascii string).
> >
> Sorry, it is 2016, and I don't think at this point anyone can consider
> an ASCII string
> as a representative pattern of textual data in any field of application.
> Bytes are not text. Bytes with an associated, meaningful, encoding are text.
>   I thought this had been through when Python 3 was out.
> Unless you are working with COBOL generated data (and intending to keep
> the file format) , it does not make sense in any real-world field.
> (supposing your
> Cobol data is ASCII and nort EBCDIC).

The fundamental purpose of the base64 encoding is to take a series
of arbitrary bytes and reversibly turn them into another series of
bytes in which the eighth bit is not significant.  Its utility is for
transmitting eight bit bytes over a channel that is not eight bit clean.
Before unicode, that meant bytes.  Now that we have unicode in use in
lots of places, you can think of unicode as a communications channel
that is not eight bit clean.  So, we might want to use base64 encoding to
transmit arbitrary bytes over a unicode channel.  This gives a legitimate
reason to want unicode output from a base64 encoder.   However, it is
equally legitimate in the Python context to say you should be explicit
about your intentions by decoding the bytes output of the base64 encoder
using the ASCII codec.

This was indeed discussed at length.  For a while we didn't even allow
unicode input on either side, but we relaxed that.  My understanding of
Python's current stance on functions that handle both bytes and string
is that *either* the function accepts both types and outputs the *same*
type as the input, *or* it accepts both types but always outputs *one*
type or the other.

You can't have unicode output if you give unicode input to the base64
decoder in the general case.  So decode, at least, has to always give
bytes output.  Likewise, there is small to zero utility for using unicode
input to the base64 encoder, since the unicode would have to be ASCII
only and there'd be no point in doing the encoding.  So, the only thing
that makes sense is to follow the "one output type" rule here.

Now, you can argue whether or not it would make sense for the encoder
to always produce unicode.  However, you then immediately run into the
backward compatibility issue:  the primary use case of the base64 encoding
is to produce *wire ready* bytes.  This is what the email package uses
it for, for example.  So for backward compatibility reasons, which
are consonant with its primary use case, it makes more sense for the
encoder to produce bytes than string.  If you need to transmit bytes
over a unicode channel, you can decode it from ASCII.  That is,
unicode is the *exceptional* use case here, not the rule.  That might
in fact be changing, but for backward compatibility reasons, Python
won't change.

And that should answer Steve's original question :)


More information about the Python-Dev mailing list