[Python-Dev] Divorcing str and unicode (no more implicit conversions).

Neil Hodgson nyamatongwe at gmail.com
Mon Oct 24 09:24:23 CEST 2005


Martin v. Löwis:

> That's very tricky. If you have multiple implementations, you make
> usage at the C API difficult. If you make it either UTF-8 or UTF-32,
> you make PythonWin difficult. If you make it UTF-16, you make indexing
> difficult.

   For Windows, the code will get a little uglier, needing to perform
an allocation/encoding and deallocation more often then at present but
I don't think there will be a speed degradation as Windows is
currently performing a conversion from 8 bit to UTF-16 inside many
system calls. To minimize the cost of allocation, Python could copy
Windows in keeping a small number of commonly sized preallocated
buffers handy.

   For indexing UTF-16, a flag could be set to show if the string is
all in the base plane and if not, an index could be constructed when
and if needed. It'd be good to get some feel for what proportion of
string operations performed require indexing. Many, such as
startswith, split, and concatenation don't require indexing. The
proportion of operations that use indexing to scan strings would also
be interesting as adding a (currentIndex, currentOffset) cursor to
string objects would be another approach.

   Neil


More information about the Python-Dev mailing list