[Python-Dev] Building strings with +=
raymond.hettinger at verizon.net
Mon Jun 21 12:43:45 EDT 2004
What is wrong with the following code?
s = ""
for e in iterable:
s = s + e
What if that code could run as fast as ''.join(iterable)?
How much existing code would start to perform better?
In particular, would XML code laden with += loops run faster?
Would Python be easier to teach and write?
Would the resulting code look more natural and obvious?
Without expanding the size of the string structure, I think it is
possible to implement something equivalent to the following proposal
(which for clarity, does add two words to the structure). Roughly, the
Add two PyObject pointers to the structure, component0 and component1,
and initialize them to NULL.
Alter string_concat(a, b) to return a new string object with:
ob_sval = '\0'
ob_size = len(a) + len(b)
component0 = a // be sure to INCREF(a)
component1 = b // be sure to INCREF(b)
To the start of every string method except string_len() and
string_concat(), add a macro that expands to:
if (self->component0 != NULL)
Write an _autojoin(self) method who's goal in life is to recursively
fill-out the string from the components:
* resize self to be big enough for an ob_sized string
* treat component0 and component1 as a binary tree (with only the leaves
holding real strings) and recursively (in-order) find the leaves and
copy their string data into self->ob_sval. DECREF each component after
* set self->component0 to NULL to mark self as being fully joined.
The recursion depth should not be worrisome. In the typical case, one
side of the tree will be a leaf and will not recurse further. It gets
more interesting if the user writes things like b=a+a or has a nested
series like c=a+b; f=d+e; g=c+f. The recursion and reference counting
trivializes the handling of those cases.
An actual implementation striving to save the two additional structure
fields could use the ob_sstate field to indicate a deferred string (my
name for a string that has not yet had ob_sval filled out). The ob_sval
array itself would be given room to hold two object pointers for the
components and access to those pointers would be cast to (PyObject *).
If this works out, it would be an innovation. The join([a,b,c,d])
versus a+b+c+d problem also exists in several other languages. AFAICT,
Python would have been the first to solve it.
As a nice side benefit, __builtin__.sum() would no longer need to reject
the case where the inputs were strings.
There is a proof-of-concept patch for this at:
Using UserString, the patch demonstrates a simplified implementation
that converts the O(n**2) string addition process into an O(n) process
like str.join(). Since the patch passes the test suite, it seems likely
that the above could be implemented in a way that is transparent to the
P.S. This is a repeat post. I don't think the first one made it
through mailman this week.
More information about the Python-Dev