diferences between 22 and python 23

Sun Dec 7 19:49:59 EST 2003

On Sun, 07 Dec 2003 21:39:03 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <martin at v.loewis.de> wrote:

>Bengt Richter wrote:
>
>> This sounds very similar to what I have been trying to say.
>
>I would really suggest that you implement your ideas. You
>*will* find that they are unimplementable. After adjusting
>the ideas to constraints of reality, you *will* find that
>you break backwards compatibility. After fixing the backwards
>compatibility problems, you *will* find that your implementation
>has very bad performance characteristics, compared to the
>existing string types.
>
>Unfortunately, it is very difficult to nail down the problems
>you will encounter, as you refuse to provide a complete
I may be failing to communicate, but I am not refusing ;-)

>specification of the interface and implementation that you
>propose. Originally, I thought you are proposing modifications
>to <type 'str'>, but now it appears that you are proposing
I was, and am (though I confess that discussion is making it
a somewhat moving target, as more ideas pop up ;-)
>a new data type, which has large similarities with <type
>'unicode'>. If so, I fail to understand why you don't want
>to use the existing Unicode type.
It's not that I don't want to use it. I think it's great. I am just
trying to sort out where the existing str is really being used as a character
sequence type and where it is really a byte buffer type that has no character
significance until a decoding intepretation is imposed.

As I've said, ISTM most string literals in program sources are really character
strings, and could well become unicode right in the tokenizer (that part is a new
statement ;-) But there are some string literals that really do represent bytes,
not characters, and there is a legacy of byte-producing object interfaces that
claim to be producing the same type thing (str) as string literals. This is problematic IMO.

ISTM there really are two different types that need to be disentangled (charstring vs bytestring,
or chars vs bytes for short).

I was trying to imagine tweaks to str that could make it play both roles more explicitly,
but thinking more, it's probably a wrong approach. Bite the bullet and separate them ;-)

OTTOMH I don't thing charstring should be unicode, because that excludes custom character sets that
may not exist in unicode, and/or will be a pain to create a private unicode map for. That's
why I think charstring must be effectively be something like a (bytestring, codec) pair.
Undoubtedly it will have an intimate optimized relationship to unicode. But a charstring type
would have the option of delegating efficiently in single-encoding environments, IWT, where
things would work much as they do now.

How to bring in the bytestring type is "interesting". I guess one approach would be to let it
be str minus its current charstring uses. But there's so many charstring uses that maybe it makes
more sense to let str be the charstring. But there are so much legacy interfaces that produce
bytestrings as str ;-/ Round and round. So I am pulled back the idea of .coding-attributed str's,
even though it's not too clean. Maybe charstrings as a str subclass? Ugh. I think
conceptual correctness (or lack thereof) has bitten ;-/

>
>Notice that /F has something completely different in mind:
>He is still talking about the Python Unicode type, and just
>suggesting that a different internal representation should
>be used. Speculating about the motivation, I would think he
>has efficiency in the face of round-trip conversions in mind,
>but not a change in visible behaviour.
That's not completely different ;-) I was trying to tweak str to play
its dual roles of charstring and bytestring explicitly, and I think the
charstring side is almost exactly what /F was talking about. A reasonable
implementation would be to have charstring be /F's unicode exactly, though
I'm always wanting to be "conceptually correct" so the concept of charstring
would have to include totally arbitrary character sets, not just those found
in unicode ;-)

Regards,
Bengt Richter