
Nick Coghlan <ncoghlan@gmail.com> wrote:
On Wed, Jun 1, 2011 at 2:16 AM, Bill Janssen <janssen@parc.com> wrote:
I like the deprecations you suggest, but I'd prefer to see a more general solution: the 'str' type extended so that it had two possible representations for strings, the current format and an "encoded" format, which would be kept as an array of bytes plus an encoding. It would transcode only as necessary -- for example, the 're' module might require the current Unicode encoding. An explicit method would be added to allow the user to force transcoding.
This would complicate life at the C level, to be sure. Though, perhaps not so much, given the proper macrology.
See PEP 393 - it is basically this idea
Should have realized Martin would have thought of this :-). I'm not sure how I missed it back in January -- high drama at work distracted me, I guess. I might do it a bit differently, with just one pointer, say, "data", and a field which carries the encoding (possibly as a pointer to the appropriate codec). "data" would point to a buffer of the correct type. New strings would by default still be created as UCS-2 or UCS-4 Unicode, just as per today. I'd also allow any encoding which we have a codec for, so that if you are reading from a file containing encoded text, you can carry the exact bytes around unless you need to do something which isn't supported for that encoding -- in which case things get Unicodified behind the scenes. We'd smarten the various string methods over time so that most of them would work so long as the operands matched. str.index, for instance, wouldn't require decoding unless the two strings were of different encodings. Yes, there'd be some "magic" going on, but it wouldn't be worse than the automatic coercions Python does now -- that's just what a HLL does for you.
(although the encodings are fixed for the various sizes rather than allowing arbitrary encodings in the 8-bit internal format).
IMO, the thing that bit us on the fundament with the 2.x str/unicode divide, and continues to bite us with the 3.x str/bytes divide is that we don't carry the encoding as part of the 2.x 'str' value (or as part of the 3.x 'bytes' value). The key here is to store the encoding internally in the string object, so that it's available to do automatic coercion when necessary, rather than *requiring* all coercions to be done manually by some program code. Bill