jcarlson at uci.edu
Sat Feb 18 08:05:48 CET 2006
Bob Ippolito <bob at redivi.com> wrote:
> On Feb 17, 2006, at 8:33 PM, Josiah Carlson wrote:
> > Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> >> Stephen J. Turnbull wrote:
> >>>>>>>> "Guido" == Guido van Rossum <guido at python.org> writes:
> >>> Guido> - b = bytes(t, enc); t = text(b, enc)
> >>> +1 The coding conversion operation has always felt like a
> >>> constructor
> >>> to me, and in this particular usage that's exactly what it is. I
> >>> prefer the nomenclature to reflect that.
> >> This also has the advantage that it competely
> >> avoids using the verbs "encode" and "decode"
> >> and the attendant confusion about which direction
> >> they go in.
> >> e.g.
> >> s = text(b, "base64")
> >> makes it obvious that you're going from the
> >> binary side to the text side of the base64
> >> conversion.
> > But you aren't always getting *unicode* text from the decoding of
> > bytes,
> > and you may be encoding bytes *to* bytes:
> > b2 = bytes(b, "base64")
> > b3 = bytes(b2, "base64")
> > Which direction are we going again?
> This is *exactly* why the current set of codecs are INSANE.
> unicode.encode and str.decode should be used *only* for unicode
> codecs. Byte transforms are entirely different semantically and
> should be some other method pair.
The problem is that we are overloading data types. Strings (and bytes)
can contain both encoded text as well as data, or even encoded data.
Unless the plan is to make bytes _only_ contain encoded unicode, or
_only_ data, or _only_ encoded data, the confusion for users may continue.
Me, I'm a fan of education. Educating your users is simple, and if you
have good exceptions and documentation, it gets easier. Raise an
exception when a user tries to use a codec which doesn't have a
particular source ('...'.decode('utf-8') should raise an error like
"Cannot use text as a source for 'utf-8' decoding", when unicode/text
becomes the default format for string literals).
Tossing out bytes.encode(), as well as decodings for bytes->bytes, also
brings up the issue of text.decode() for pure text transformations. Are
we going to push all of those transformations somewhere else?
Look at what we've currently got going for data transformations in the
standard library to see what these removals will do: base64 module,
binascii module, binhex module, uu module, ... Do we want or need to
add another top-level module for every future encoding/codec that comes
out (or does everyone think that we're done seeing codecs)? Do we want
to keep monkey-patching binascii with names like 'a2b_hqx'? While there
is currently one text->text transform (rot13), do we add another module
for text->text transforms? Would it start having names like t2e_rot13()
Educate the users. Raise better exceptions telling people why their
encoding or decoding failed, as Ian Bicking already pointed out. If
bytes.encode() and the equivalent of text.decode() is going to disappear,
Bengt Richter had a good idea with bytes.recode() for strictly bytes
transformations (and the equivalent for text), though it is ambiguous as
to the direction; are we encoding or decoding with bytes.recode()? In
my opinion, this is why .encode() and .decode() makes sense to keep on
both bytes and text, the direction is unambiguous, and if one has even a
remote idea of what the heck the codec is, they know their result.
More information about the Python-Dev