Re: [Python-ideas] Adding 'bytes' as alias for 'latin_1' codec.

June 1, 2011

      Nick Coghlan <ncoghlan@gmail.com> wrote:
...
On Wed, Jun 1, 2011 at 2:16 AM, Bill Janssen <janssen@parc.com> wrote:
...
I like the deprecations you suggest, but I'd prefer to see a more
general solution: the 'str' type extended so that it had two possible
representations for strings, the current format and an "encoded" format,
which would be kept as an array of bytes plus an encoding.  It would
transcode only as necessary -- for example, the 're' module might
require the current Unicode encoding.  An explicit method would be added
to allow the user to force transcoding.
This would complicate life at the C level, to be sure.  Though, perhaps
not so much, given the proper macrology.
See PEP 393 - it is basically this idea
Should have realized Martin would have thought of this :-).  I'm not
sure how I missed it back in January -- high drama at work distracted
me, I guess.

I might do it a bit differently, with just one pointer, say, "data", and
a field which carries the encoding (possibly as a pointer to the
appropriate codec).  "data" would point to a buffer of the correct type.
New strings would by default still be created as UCS-2 or UCS-4 Unicode,
just as per today.

I'd also allow any encoding which we have a codec for, so that if you
are reading from a file containing encoded text, you can carry the exact
bytes around unless you need to do something which isn't supported for
that encoding -- in which case things get Unicodified behind the scenes.
We'd smarten the various string methods over time so that most of them
would work so long as the operands matched.  str.index, for instance,
wouldn't require decoding unless the two strings were of different
encodings.  Yes, there'd be some "magic" going on, but it wouldn't be
worse than the automatic coercions Python does now -- that's just what a
HLL does for you.
...
(although the encodings are
fixed for the various sizes rather than allowing arbitrary encodings
in the 8-bit internal format).
IMO, the thing that bit us on the fundament with the 2.x str/unicode
divide, and continues to bite us with the 3.x str/bytes divide is that
we don't carry the encoding as part of the 2.x 'str' value (or as part
of the 3.x 'bytes' value).  The key here is to store the encoding
internally in the string object, so that it's available to do automatic
coercion when necessary, rather than *requiring* all coercions to be
done manually by some program code.

Bill