[Python-ideas] Adding 'bytes' as alias for 'latin_1' codec.

Wed Jun 1 18:34:03 CEST 2011

Nick Coghlan <ncoghlan at gmail.com> wrote:

> On Wed, Jun 1, 2011 at 2:16 AM, Bill Janssen <janssen at parc.com> wrote:
> > I like the deprecations you suggest, but I'd prefer to see a more
> > general solution: the 'str' type extended so that it had two possible
> > representations for strings, the current format and an "encoded" format,
> > which would be kept as an array of bytes plus an encoding.  It would
> > transcode only as necessary -- for example, the 're' module might
> > require the current Unicode encoding.  An explicit method would be added
> > to allow the user to force transcoding.
> >
> > This would complicate life at the C level, to be sure.  Though, perhaps
> > not so much, given the proper macrology.
> 
> See PEP 393 - it is basically this idea

Should have realized Martin would have thought of this :-).  I'm not
sure how I missed it back in January -- high drama at work distracted
me, I guess.

I might do it a bit differently, with just one pointer, say, "data", and
a field which carries the encoding (possibly as a pointer to the
appropriate codec).  "data" would point to a buffer of the correct type.
New strings would by default still be created as UCS-2 or UCS-4 Unicode,
just as per today.

I'd also allow any encoding which we have a codec for, so that if you
are reading from a file containing encoded text, you can carry the exact
bytes around unless you need to do something which isn't supported for
that encoding -- in which case things get Unicodified behind the scenes.
We'd smarten the various string methods over time so that most of them
would work so long as the operands matched.  str.index, for instance,
wouldn't require decoding unless the two strings were of different
encodings.  Yes, there'd be some "magic" going on, but it wouldn't be
worse than the automatic coercions Python does now -- that's just what a
HLL does for you.

> (although the encodings are
> fixed for the various sizes rather than allowing arbitrary encodings
> in the 8-bit internal format).

IMO, the thing that bit us on the fundament with the 2.x str/unicode
divide, and continues to bite us with the 3.x str/bytes divide is that
we don't carry the encoding as part of the 2.x 'str' value (or as part
of the 3.x 'bytes' value).  The key here is to store the encoding
internally in the string object, so that it's available to do automatic
coercion when necessary, rather than *requiring* all coercions to be
done manually by some program code.

Bill