[Python-Dev] Divorcing str and unicode (no more implicit conversions).

Tue Oct 25 00:47:22 CEST 2005

On 10/24/05, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Guido van Rossum wrote:
> > Changing the APIs would be much work, although perhaps not impossible
> > of Python 3000. For example, Raymond Hettinger's partition() API
> > doesn't refer to indices at all, and can replace many uses of find()
> > or index().
>
> I think Neil's proposal is not to make them go away, but to implement
> them less efficiently. For example, if the internal representation
> is UTF-8, indexing requires linear time, as opposed to constant time.
> If the internal representation is UTF-16, and you have a flag to
> indicate whether there are any surrogates on the string, indexing
> is constant if the flag is false, else linear.

I understand all that. My point is that it's a bad idea to offer an
indexing operation that isn't O(1).

> > Perhaps we could provide a different kind of API to support the
> > latter, perhaps based on a mutable character buffer data type without
> > direct indexing?
>
> There are different design goals conflicting here:
> - some think: "all my data is ASCII, so I want to only use one
>    byte per character".
> - others think: "all my data goes to the Windows API, so I want
>    to use 2 byte per character".
> - yet others think: "I want all of Unicode, with proper, efficient
>    indexing, so I want four bytes per char".

I doubt the last one though. Probably they really don't want efficient
indexing, they want to perform higher-level operations that currently
are only possible using efficient indexing or slicing. With the right
API. perhaps they could work just as efficiently with an internal
representation of UTF-8.

> It's not so much a matter of API as a matter of internal
> representation. The API doesn't have to change (except for the
> very low-level C API that directly exposes Py_UNICODE*, perhaps).

I think the API should reflect the representation *to some extend*,
namely it shouldn't claim to have operations that are typically
thought of as O(1) that can only be implemented as O(n). An internal
representation of UTF-8 might make everyone happy except heavy Windows
users; but it requires changes to the API so people won't be writing
Python 2.x-style string slinging code.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)