Speeding up: s += "string"

Tue Apr 15 09:59:50 EDT 2003

On Tue, Apr 15, 2003 at 02:38:47PM +0300, Beni Cherniavsky wrote:
> > It seems to me that the behavior CAN be improved.  If strings were to
> > use a preallocation strategy--similar to what lists do--MANY string
> > appends could avoid a memory copy on the original string.  The intial
> > value could stay where it was, and the extra characters could live in
> > the already-available contiguous space.  Obviously, the utilized length
> > would need to be stored somewhere.  Once in a while, of course, the
> > preallocation would fill up; in that case you'd need to do a costly
> > strcopy()... but this could be avoided MOST of the time, ISTM.
> >
> Don't forget that strings are immutable and you need the old object to
> still represent the old string.  I see two approaches:
> 
> 1. Each string object points to a string buffer and has an end index
>    and multiple strings can share the same buffer.  Then it makes
>    perfect sense to also have a start index.  The real gain is from
>    allowing substrings (slicing) to be O(1)!  The hard part is memory
>    management, so that when you retain only a small part of a big
>    string, the unneed memory could be reclaimed.  Ironically, this
>    works well for substrings but not for concatenation because of::
> 
>      s = "some long string"
>      print s + "a"
>      print s + "b"
> 
>    (you can only grow the initial string for one of them).  This might
>    not be a problem because this usage is rare.
> 
> 2. Lazy concatenation: string + string returns a proxy object that
>    simply points to both; its materialized to a real string only when
>    needed (initially on any other string operation but it's easy to
>    implement things like iteration without requiring it).  Similar
>    approaches could work for lists and dictionaries, BTW.  For the
>    later I mean the non-commutative union of two dicts where one has
>    priority and the other is the fallback; this would be handy for
>    cases where consing to alists is used in LISP...

Guido is reluctant to add major new code to the code unless it gets some
serious real-world testing as an extension module. But in the case of
strings there is a problem: you can't create an object that will be a
drop-in replacement for a string in Python unless it's actually derived
from Py_StringObject and has the same internal structure. It can be an
object that behaves much like a string and it can have an __str__ method
to convert it into a string - but it will not be a compatible with a
string in many situations. For example, it will not be accepted as an
argument to string methods. It's not enough to make an object that can
be converted into a string (almost any Python object can). It needs to
*be* a string without sharing the same internal structure.

Such objects could be derived from the basestring class without being a
subclass of either str or unicode. The only method they must implement
is __str__. All other string methods and operations will be emulated by
calling __str__ and passing the result to the original string method.
The object may optionally override other methods to provide a more
efficient implementation. With this kind of infrastructure in place it
would be possible to experiment with mutable strings, lazy concatenation, 
shared-buffer strings, etc as extension modules. The best solution could
be considered for eventual integration into the core.

It would require quite a lot of changes (ParseTuple, almost all the
places that call PyString_Check, etc) but I think the Python code might
actually benefit from it and become smaller and cleaner.  Currently 
there are a lot of places with hard-wired check for string,/unicode and
objects exposing the buffer interface. These could be handled in a more
generic way.

   Oren