Speeding up: s += "string"
Oren Tirosh
oren-py-l at hishome.net
Tue Apr 15 09:59:50 EDT 2003
On Tue, Apr 15, 2003 at 02:38:47PM +0300, Beni Cherniavsky wrote:
> > It seems to me that the behavior CAN be improved. If strings were to
> > use a preallocation strategy--similar to what lists do--MANY string
> > appends could avoid a memory copy on the original string. The intial
> > value could stay where it was, and the extra characters could live in
> > the already-available contiguous space. Obviously, the utilized length
> > would need to be stored somewhere. Once in a while, of course, the
> > preallocation would fill up; in that case you'd need to do a costly
> > strcopy()... but this could be avoided MOST of the time, ISTM.
> >
> Don't forget that strings are immutable and you need the old object to
> still represent the old string. I see two approaches:
>
> 1. Each string object points to a string buffer and has an end index
> and multiple strings can share the same buffer. Then it makes
> perfect sense to also have a start index. The real gain is from
> allowing substrings (slicing) to be O(1)! The hard part is memory
> management, so that when you retain only a small part of a big
> string, the unneed memory could be reclaimed. Ironically, this
> works well for substrings but not for concatenation because of::
>
> s = "some long string"
> print s + "a"
> print s + "b"
>
> (you can only grow the initial string for one of them). This might
> not be a problem because this usage is rare.
>
> 2. Lazy concatenation: string + string returns a proxy object that
> simply points to both; its materialized to a real string only when
> needed (initially on any other string operation but it's easy to
> implement things like iteration without requiring it). Similar
> approaches could work for lists and dictionaries, BTW. For the
> later I mean the non-commutative union of two dicts where one has
> priority and the other is the fallback; this would be handy for
> cases where consing to alists is used in LISP...
Guido is reluctant to add major new code to the code unless it gets some
serious real-world testing as an extension module. But in the case of
strings there is a problem: you can't create an object that will be a
drop-in replacement for a string in Python unless it's actually derived
from Py_StringObject and has the same internal structure. It can be an
object that behaves much like a string and it can have an __str__ method
to convert it into a string - but it will not be a compatible with a
string in many situations. For example, it will not be accepted as an
argument to string methods. It's not enough to make an object that can
be converted into a string (almost any Python object can). It needs to
*be* a string without sharing the same internal structure.
Such objects could be derived from the basestring class without being a
subclass of either str or unicode. The only method they must implement
is __str__. All other string methods and operations will be emulated by
calling __str__ and passing the result to the original string method.
The object may optionally override other methods to provide a more
efficient implementation. With this kind of infrastructure in place it
would be possible to experiment with mutable strings, lazy concatenation,
shared-buffer strings, etc as extension modules. The best solution could
be considered for eventual integration into the core.
It would require quite a lot of changes (ParseTuple, almost all the
places that call PyString_Check, etc) but I think the Python code might
actually benefit from it and become smaller and cleaner. Currently
there are a lot of places with hard-wired check for string,/unicode and
objects exposing the buffer interface. These could be handled in a more
generic way.
Oren
More information about the Python-list
mailing list