[Python-Dev] Divorcing str and unicode (no more implicit conversions).

"Martin v. Löwis" martin at v.loewis.de
Tue Oct 25 00:21:06 CEST 2005


Guido van Rossum wrote:
> Changing the APIs would be much work, although perhaps not impossible
> of Python 3000. For example, Raymond Hettinger's partition() API
> doesn't refer to indices at all, and can replace many uses of find()
> or index().

I think Neil's proposal is not to make them go away, but to implement
them less efficiently. For example, if the internal representation
is UTF-8, indexing requires linear time, as opposed to constant time.
If the internal representation is UTF-16, and you have a flag to
indicate whether there are any surrogates on the string, indexing
is constant if the flag is false, else linear.

> Perhaps we could provide a different kind of API to support the
> latter, perhaps based on a mutable character buffer data type without
> direct indexing?

There are different design goals conflicting here:
- some think: "all my data is ASCII, so I want to only use one
   byte per character".
- others think: "all my data goes to the Windows API, so I want
   to use 2 byte per character".
- yet others think: "I want all of Unicode, with proper, efficient
   indexing, so I want four bytes per char".

It's not so much a matter of API as a matter of internal
representation. The API doesn't have to change (except for the
very low-level C API that directly exposes Py_UNICODE*, perhaps).

Regards,
Martin


More information about the Python-Dev mailing list