[Python-Dev] Why can't I encode/decode base64 without importing a module?

MRAB python at mrabarnett.plus.com
Thu Apr 25 16:22:00 CEST 2013


On 25/04/2013 14:34, Lennart Regebro wrote:
> On Thu, Apr 25, 2013 at 2:57 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
>> I can think of many usecases where I want to *embed* base64-encoded
>> data in a larger text *before* encoding that text and transmitting
>> it over a 8-bit channel.
>
> That still doesn't mean that this should be the default behavior. Just
> because you *can* represent base64 as Unicode text doesn't mean that
> it should be.
>
>> (GPG signatures, binary data embedded in JSON objects, etc.)
>
> Is the GPG signature calculated on the *Unicode* data? How is that
> done? Isn't it done on the encoded message? As I understand it a GPG
> signature is done on any sort of document. Either me or you have
> completely misunderstood how GPG works, I think. :-)
>
> In the case of JSON objects, they are intended for data exchange, and
> hence in the end need to be byte strings. So if you have a byte string
> you want to base64 encode before transmitting it with json, you would
> just end up transforming it to a unicode string and then back. That
> doesn't seem useful.
>
The JSON specification says that it's text. Its string literals can
contain Unicode codepoints. It needs to be encoded to bytes for
transmission and storage, but JSON itself is not a bytestring format.

> One use case where you clearly *do* want the base64 encoded data to be
> unicode strings is because you want to embed it in a text discussing
> base64 strings, for a blog or a book or something. That doesn't seem
> to be a very common usecase.
>
> For the most part you base64 encode things because it's going to be
> transmitted, and hence the natural result of a base64 encoding should
> be data that is ready to be transmitted, hence byte strings, and not
> Unicode strings.
>
>> Python 3 doesn't *view* text as unicode, it *represents* it as unicode.
>
> I don't agree that there is a significant difference between those
> wordings in this context. The end result is the same: Things intended
> to be handled/seen as textual should be unicode strings, things
> intended for data exchange should be byte strings. Something that is
> base64 encoded is primarily intended for data exchange. A base64
> encoding should therefore return byte strings, especially since most
> API's that perform this transmission will take byte strings as input.
> If you want to include this in textual data, for whatever reason, like
> printing it in a book, then the conversion is trivial, but that is
> clearly the less common use case, and should therefore not be the
> default behavior.
>
base64 is a way of encoding binary data as text. The problem is that
traditionally text has been encoded with one byte per character, except
in those locales where there were too many characters in the character
set for that to be possible.

In Python 3 we're trying to stop mixing binary data (bytestrings) with
text (Unicode strings).


More information about the Python-Dev mailing list