bokr at oz.net
Sat Feb 18 08:24:31 CET 2006
On Fri, 17 Feb 2006 20:33:16 -0800, Josiah Carlson <jcarlson at uci.edu> wrote:
>Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
>> Stephen J. Turnbull wrote:
>> >>>>>>"Guido" == Guido van Rossum <guido at python.org> writes:
>> > Guido> - b = bytes(t, enc); t = text(b, enc)
>> > +1 The coding conversion operation has always felt like a constructor
>> > to me, and in this particular usage that's exactly what it is. I
>> > prefer the nomenclature to reflect that.
>> This also has the advantage that it competely
>> avoids using the verbs "encode" and "decode"
>> and the attendant confusion about which direction
>> they go in.
>> s = text(b, "base64")
>> makes it obvious that you're going from the
>> binary side to the text side of the base64
>But you aren't always getting *unicode* text from the decoding of bytes,
>and you may be encoding bytes *to* bytes:
> b2 = bytes(b, "base64")
> b3 = bytes(b2, "base64")
>Which direction are we going again?
Well, base64 is probably not your best example, because it necessarily involves characters ;-)
If you are using "base64" you are looking at characters in your input to
produce your bytes output. The only way you can see characters in bytes input
is to decode them. So you are hiding your assumption about b's encoding.
You can make useful rules of inference from type(b), but with bytes you really
don't know. "base64" has to interpret b bytes as characters, because that's what
it needs to recognize base64 characters, to produce the output bytes.
The characters in b could be encoded in plain ascii, or utf16le, you have to know.
So for utf16le it should be
b2 = bytes(text(b, 'utf16le'), "base64")
just because you assume an implicit
b2 = bytes(text(b, 'ascii'), "base64")
doesn't make it so in general. Even if you build that assumption in,
it's not really true that you are going "bytes *to* bytes" without characters
involved when you do bytes(b, "base64"). You have just left undocumented an API restriction
(assert <bytes input is an ascii encoding of base64 characters>) and an implementation
This is the trouble with str.encode and unicode.decode. They both hide implicit
decodes and encodes respectively. They should be banned IMO. Let people spell it out
and maybe understand what they are doing.
OTOH, a bytes-to-bytes codec might be decompressing tgz into tar. For conceptual consistency,
one might define a 'bytes' encoding that conceptually turns bytes into unicode byte characters and
vice versa. Then "gunzip" can decode bytes, producing unicode characters which are then
encoded back to bytes from the unicode ;-) The 'bytes' encoding would numerically be just like
latin-1 except on the unicode side it would have wrapped-bytes internal representation.
b_tar = bytes(text(b_tgz, 'gunzip'), 'bytes')
of course, text(b_tgz, 'gunzip') would produce unicode text with a special internal representation that
just wraps bytes though they are true unicode. The 'bytes' codec encode of course would just unwrap the
internal bytes representation, but it would conceptually be an encoding into bytes. bytes(t, 'latin-1')
would produce the same output from the wrapped bytes unicode.
Sometimes conceptual purity can clarify things and sometimes it's just another confusing description.
More information about the Python-Dev