[Python-Dev] Divorcing str and unicode (no more implicit conversions).

Antoine Pitrou solipsis at pitrou.net
Mon Oct 24 23:22:23 CEST 2005


> There are many design alternatives: one option would be to support
> *three* internal representations in a single type, generating the
> others from the one operation existing as needed. The default, initial
> representation might be UTF-8, with UCS-4 only being generated when
> indexing occurs, and UCS-2 only being generated when the API requires
> it. On concatenation, always concatenate just one represenation: either
> one that is already present in both operands, else UTF-8.

Wouldn't it be simpler to use:
- one-byte representation if every character <= 0xFF
- two-byte representation if every character <= 0xFFFF
- four-byte representation otherwise

Then combining several strings means using the larger representation as
a result (*). In practice, most use cases will not involve the four-byte
representation.

(*) a heuristic can be invented so that, when producing a smaller string
(by stripping/slicing/etc.), it will "sometimes" check whether a
narrower representation is possible.
For example : store the length of the string when the last check
occurred, and do a new check when the length falls below the half that
value.

Regards

Antoine.




More information about the Python-Dev mailing list