Stephen J. Turnbull
stephen at xemacs.org
Thu Feb 23 07:05:43 CET 2006
>>>>> "Greg" == Greg Ewing <greg.ewing at canterbury.ac.nz> writes:
Greg> Stephen J. Turnbull wrote:
>> Base64 is a (family of) wire protocol(s). It's not clear to me
>> that it makes sense to say that the alphabets used by "baseNN"
>> encodings are composed of characters,
Greg> Take a look at [this that the other]
Those references use "character" in an ambiguous and ill-defined way.
Trying to impose Python unicode object semantics on "vague characters"
is a bad idea IMO.
Greg> Which seems to make it perfectly clear that the result of
Greg> the encoding is to be considered as characters, which are
Greg> not necessarily going to be encoded using ascii.
Please define "character," and explain how its semantics map to
Python's unicode objects.
Greg> So base64 on its own is *not* a wire protocol. Only after
Greg> encoding the characters do you have a wire protocol.
No, base64 isn't a wire protocol. Rather, it's a schema for a family
of wire protocols, whose alphabets are heuristically chosen on the
assumption that code units which happen to correspond to alpha-numeric
code points in a commonly-used coded character set are more likely to
pass through a communication channel without corruption.
Note that I have _precisely_ defined what I mean. You still have the
problem that you haven't defined character, and that is a real
problem, see below.
>> I don't see any case for "correctness" here, only for
Greg> I'm thinking of convenience, too. Keep in mind that in Py3k,
Greg> 'unicode' will be called 'str' (or something equally neutral
Greg> like 'text') and you will rarely have to deal explicitly
Greg> with unicode codings, this being done mostly for you by the
Greg> I/O objects. So most of the time, using base64 will be just
Greg> as convenient as it is today: base64_encode(my_bytes) and
Greg> write the result out somewhere.
Convenient, yes, but incorrect. Once you mix those bytes with the
Python string type, they become subject to all the usual operations on
characters, and there's no way for Python to tell you that you didn't
want to do that. Ie,
Greg> Whereas if the result is text, the right thing happens
Greg> automatically whatever the ultimate encoding turns out to
Greg> be. You can take the text from your base64 encoding, combine
Greg> it with other text from any other source to form a complete
Greg> mail message or xml document or whatever, and write it out
Greg> through a file object that's using any unicode encoding at
Greg> all, and the result will be correct.
Only if you do no transformations that will harm the base64-encoding.
This is why I say base64 is _not_ based on characters, at least not in
the way they are used in Python strings. It doesn't allow any of the
usual transformations on characters that might be applied globally to
a mail composition buffer, for example.
In other words, you don't escape from the programmer having to know
what he's doing. EIBTI, and the setup I advocate forces the
programmer to explicitly decide where to convert base64 objects to a
textual representation. This reminds him that he'd better not touch
Greg> The reason I say it's *corrrect* is that if you go straight
Greg> from bytes to bytes, you're *assuming* the eventual encoding
Greg> is going to be an ascii superset. The programmer is going
Greg> to have to know about this assumption and understand all its
Greg> consequences and decide whether it's right, and if not, do
Greg> something to change it.
I'm not assuming any such thing, except in the context of analysis of
implementation efficiency. And the programmer needs to know about the
semantics of text that is actually a base64-encoded object, and that
they are different from string semantics.
This is something that programmers are used to dealing with in the
case of Python 2.x str and C char; the whole point of the unicode
type is to allow the programmer to abstract from that when dealing
human-readable text. Why confuse the issue.
>> And in the classroom, you're just going to confuse students by
>> telling them that UTF-8 --[Unicode codec]--> Python string is
>> decoding but UTF-8 --[base64 codec]--> Python string is
>> encoding, when MAL is telling them that --> Python string is
>> always decoding.
Greg> Which is why I think that only *unicode* codings should be
Greg> available through the .encode and .decode interface. Or
Greg> alternatively there should be something more explicit like
Greg> .unicode_encode and .unicode_decode that is thus restricted.
Greg> Also, if most unicode coding is done in the I/O objects,
Greg> there will be far less need for programmers to do explicit
Greg> unicode coding in the first place, so likely it will become
Greg> more of an advanced topic, rather than something you need to
Greg> come to grips with on day one of using unicode, like it is
So then you bring it right back in with base64. Now they need to know
about bytes<->unicode codecs.
Of course it all comes down to a matter of judgment. I do find your
position attractive, but I just don't think it will work for naive
users the way you think it will. It's also possible to make a precise
statement of the rationale for my approach, which I have not been able
to achieve for the "base64 uses characters" approach, and nobody else
has demonstrated one, yet.
On the other hand, I don't think either approach imposes substantially
more burden on the advanced programmer, nor does either proposal
involve a specific restriction on usage (aka "dumbing down the
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
More information about the Python-Dev