[Python-Dev] Divorcing str and unicode (no more implicit conversions).

Mon Oct 24 22:44:38 CEST 2005

Neil Hodgson wrote:
>    For Windows, the code will get a little uglier, needing to perform
> an allocation/encoding and deallocation more often then at present but
> I don't think there will be a speed degradation as Windows is
> currently performing a conversion from 8 bit to UTF-16 inside many
> system calls.
[...]
> 
>    For indexing UTF-16, a flag could be set to show if the string is
> all in the base plane and if not, an index could be constructed when
> and if needed.

There are many design alternatives: one option would be to support
*three* internal representations in a single type, generating the
others from the one operation existing as needed. The default, initial
representation might be UTF-8, with UCS-4 only being generated when
indexing occurs, and UCS-2 only being generated when the API requires
it. On concatenation, always concatenate just one represenation: either
one that is already present in both operands, else UTF-8.

 > It'd be good to get some feel for what proportion of
> string operations performed require indexing. Many, such as
> startswith, split, and concatenation don't require indexing. The
> proportion of operations that use indexing to scan strings would also
> be interesting as adding a (currentIndex, currentOffset) cursor to
> string objects would be another approach.

Indeed. My guess is that indexing is more common than you think,
especially when iterating over the string. Of course, iteration
could also operate on UTF-8, if you introduced string iterator
objects.

Regards,
Martin