[Python-3000] base64 - bytes and strings

Talin talin at acm.org
Mon Jul 30 03:21:13 CEST 2007


Guido van Rossum wrote:
> On 7/29/07, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
>> Martin v. Löwis wrote:
>>> The point that proponents of "base64 encoding should
>>> yield strings" miss is that US-ASCII is *both* a character set,
>>> and an encoding.
>> Last time we discussed this, I went and looked at the
>> RFC where base64 is defined. According to my reading of
>> it, nowhere does it say that base64 output must be
>> encoded as US-ASCII, nor any other particular encoding.
>>
>> It *does* say that the characters used were chosen because
>> they are present in a number of different character sets
>> in use at the time, and explicity mentions EBCDIC as one
>> of those character sets.
>>
>> To me this quite clearly says that base64 is defined at
>> the level of characters, not encodings.
> 
> I think it's all besides the point. We should look at the use cases. I
> recall finding out once that a Java base64 implementation was much
> slower than Python's -- turns out that the Java version was converting
> everything to Strings; then we needed to convert back to bytes in
> order to output them. My suspicion is that in the end using bytes is
> more efficient *and* more convenient; it might take some looking
> through the email package to confirm or refute this. (The email
> package hasn't been converted to work in the struni branch; that
> should happen first. Whoever does that might well be the one who tells
> us how they want their base64 APIs.)
> 
> An alternative might be to provide both string- and bytes-based APIs,
> although that doesn't help with deciding what the default one (the one
> that uses the same names as 2.x) should do.

One has to be careful when comparing performance with Java, because you 
need to specify whether you are using the "old" API or the "new" one. 
(It seems that almost everything in Java has an old and new API.)

I just recently did some work in Java with base64 encoding, or more 
specifically, URL-safe encoding. The library I was working with both 
consumed and produced arrays of bytes. I think that this is the correct 
way to do it.

In my specific use case, I was dealing with encrypted bytes, where the 
encrypter also produced and consumed bytes, so it made sense that the 
character encoder did the same. But even in the case where no encryption 
is involved, I think dealing with bytes is right.

I believe that converting a Unicode string to a base64 encoded form is 
necessarily a 2-step process. Step 1 is to convert from unicode 
characters to bytes, using an appropriate character encoding (UTF-8, 
UTF-16, and so on), and step 2 is to encode the bytes in base64. The 
resulting encoded byte array is actually an ASCII-encoded string, 
although it's more convenient in most cases to represent it as a byte 
array than as a string object, since it's likely in most cases that you 
are about to send it over the wire. So in other words, it makes sense to 
think about the conversion as (string -> bytes -> string), the actual 
objects being generated are (string -> bytes -> bytes).

The fact that 2 steps are needed is evident by the fact that there are 
actually two encodings involved, and these two encodings are mostly 
independent. So for example, one could just as easily base64-encode a 
UTF-16 encoded string as opposed to a UTF-8 encoded string. So the fact 
that you can vary one encoding without changing the other would seem to 
argue for the notion that they are distinct and independent.

Nor can you collapse to a single encoding step - you can't go directly 
from an internal unicode string to base64, since a unicode string is an 
array of code units which range from 1-0xffff, and base64 can't encode a 
number larger than 255.

Now, you *could* do both steps in a single function. However, you still 
have to choose what the intermediate encoding form is, even if you never 
actually see it. Usually this will be UTF-8.

-- Talin


More information about the Python-3000 mailing list