[Python-Dev] Building strings with +=

Raymond Hettinger raymond.hettinger at verizon.net
Mon Jun 21 12:43:45 EDT 2004


What is wrong with the following code?

    def join(*iterable):
        s = ""
        for e in iterable:
            s = s + e
        return s


What if that code could run as fast as ''.join(iterable)?
How much existing code would start to perform better?
In particular, would XML code laden with += loops run faster?
Would Python be easier to teach and write?
Would the resulting code look more natural and obvious?

Without expanding the size of the string structure, I think it is
possible to implement something equivalent to the following proposal
(which for clarity, does add two words to the structure).  Roughly, the
idea is:

Add two PyObject pointers to the structure, component0 and component1,
and initialize them to NULL.

Alter string_concat(a, b) to return a new string object with:
    ob_sval = '\0'
    ob_size = len(a) + len(b)
    component0 = a         // be sure to INCREF(a)
    component1 = b         // be sure to INCREF(b)

To the start of every string method except string_len() and
string_concat(), add a macro that expands to:

    if (self->component0 != NULL) 
        _autojoin(self);

Write an _autojoin(self) method who's goal in life is to recursively
fill-out the string from the components:

* resize self to be big enough for an ob_sized string
* treat component0 and component1 as a binary tree (with only the leaves
holding real strings) and recursively (in-order) find the leaves and
copy their string data into self->ob_sval.  DECREF each component after
traversing it.
* set self->component0 to NULL to mark self as being fully joined.

The recursion depth should not be worrisome.  In the typical case, one
side of the tree will be a leaf and will not recurse further.  It gets
more interesting if the user writes things like b=a+a or has a nested
series like c=a+b; f=d+e; g=c+f.  The recursion and reference counting
trivializes the handling of those cases.

An actual implementation striving to save the two additional structure
fields could use the ob_sstate field to indicate a deferred string (my
name for a string that has not yet had ob_sval filled out).  The ob_sval
array itself would be given room to hold two object pointers for the
components and access to those pointers would be cast to (PyObject *).

If this works out, it would be an innovation.  The join([a,b,c,d])
versus a+b+c+d problem also exists in several other languages.  AFAICT,
Python would have been the first to solve it.

As a nice side benefit, __builtin__.sum() would no longer need to reject
the case where the inputs were strings.

There is a proof-of-concept patch for this at:

    www.python.org/sf/976162

Using UserString, the patch demonstrates a simplified implementation
that converts the O(n**2) string addition process into an O(n) process
like str.join().  Since the patch passes the test suite, it seems likely
that the above could be implemented in a way that is transparent to the
user.



Raymond Hettinger


P.S.  This is a repeat post.  I don't think the first one made it
through mailman this week.




More information about the Python-Dev mailing list