[Python-3000] base64 - bytes and strings
talin at acm.org
Mon Jul 30 03:21:13 CEST 2007
Guido van Rossum wrote:
> On 7/29/07, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
>> Martin v. Löwis wrote:
>>> The point that proponents of "base64 encoding should
>>> yield strings" miss is that US-ASCII is *both* a character set,
>>> and an encoding.
>> Last time we discussed this, I went and looked at the
>> RFC where base64 is defined. According to my reading of
>> it, nowhere does it say that base64 output must be
>> encoded as US-ASCII, nor any other particular encoding.
>> It *does* say that the characters used were chosen because
>> they are present in a number of different character sets
>> in use at the time, and explicity mentions EBCDIC as one
>> of those character sets.
>> To me this quite clearly says that base64 is defined at
>> the level of characters, not encodings.
> I think it's all besides the point. We should look at the use cases. I
> recall finding out once that a Java base64 implementation was much
> slower than Python's -- turns out that the Java version was converting
> everything to Strings; then we needed to convert back to bytes in
> order to output them. My suspicion is that in the end using bytes is
> more efficient *and* more convenient; it might take some looking
> through the email package to confirm or refute this. (The email
> package hasn't been converted to work in the struni branch; that
> should happen first. Whoever does that might well be the one who tells
> us how they want their base64 APIs.)
> An alternative might be to provide both string- and bytes-based APIs,
> although that doesn't help with deciding what the default one (the one
> that uses the same names as 2.x) should do.
One has to be careful when comparing performance with Java, because you
need to specify whether you are using the "old" API or the "new" one.
(It seems that almost everything in Java has an old and new API.)
I just recently did some work in Java with base64 encoding, or more
specifically, URL-safe encoding. The library I was working with both
consumed and produced arrays of bytes. I think that this is the correct
way to do it.
In my specific use case, I was dealing with encrypted bytes, where the
encrypter also produced and consumed bytes, so it made sense that the
character encoder did the same. But even in the case where no encryption
is involved, I think dealing with bytes is right.
I believe that converting a Unicode string to a base64 encoded form is
necessarily a 2-step process. Step 1 is to convert from unicode
characters to bytes, using an appropriate character encoding (UTF-8,
UTF-16, and so on), and step 2 is to encode the bytes in base64. The
resulting encoded byte array is actually an ASCII-encoded string,
although it's more convenient in most cases to represent it as a byte
array than as a string object, since it's likely in most cases that you
are about to send it over the wire. So in other words, it makes sense to
think about the conversion as (string -> bytes -> string), the actual
objects being generated are (string -> bytes -> bytes).
The fact that 2 steps are needed is evident by the fact that there are
actually two encodings involved, and these two encodings are mostly
independent. So for example, one could just as easily base64-encode a
UTF-16 encoded string as opposed to a UTF-8 encoded string. So the fact
that you can vary one encoding without changing the other would seem to
argue for the notion that they are distinct and independent.
Nor can you collapse to a single encoding step - you can't go directly
from an internal unicode string to base64, since a unicode string is an
array of code units which range from 1-0xffff, and base64 can't encode a
number larger than 255.
Now, you *could* do both steps in a single function. However, you still
have to choose what the intermediate encoding form is, even if you never
actually see it. Usually this will be UTF-8.
More information about the Python-3000