[Python-Dev] Divorcing str and unicode (no more implicit conversions).

Mon Oct 24 23:31:18 CEST 2005

On 10/24/05, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Indeed. My guess is that indexing is more common than you think,
> especially when iterating over the string. Of course, iteration
> could also operate on UTF-8, if you introduced string iterator
> objects.

Python's slice-and-dice model pretty much ensures that indexing is
common. Almost everything is ultimately represented as indices: regex
search results have the index in the API, find()/index() return
indices, many operations take a start and/or end index. As long as
that's the case, indexing better be fast.

Changing the APIs would be much work, although perhaps not impossible
of Python 3000. For example, Raymond Hettinger's partition() API
doesn't refer to indices at all, and can replace many uses of find()
or index().

Still, the mere existence of __getitem__ and __getslice__ on strings
makes it necessary to implement them efficiently. How realistic would
it be to drop them? What should replace them? Some kind of abstract
pointers-into-strings perhaps, but that seems much more complex.

The trick seems to be to support both simple programs manipulating
short strings (where indexing is probably the easiest API to
understand, and the additional copying is unlikely to cause
performance problems) , as well as  programs manipulating very large
buffers containing text and doing sophisticated string processing on
them. Perhaps we could provide a different kind of API to support the
latter, perhaps based on a mutable character buffer data type without
direct indexing?

--
--Guido van Rossum (home page: http://www.python.org/~guido/)