[Python-Dev] Building strings with +=
Raymond Hettinger
raymond.hettinger at verizon.net
Mon Jun 21 12:43:45 EDT 2004
What is wrong with the following code?
def join(*iterable):
s = ""
for e in iterable:
s = s + e
return s
What if that code could run as fast as ''.join(iterable)?
How much existing code would start to perform better?
In particular, would XML code laden with += loops run faster?
Would Python be easier to teach and write?
Would the resulting code look more natural and obvious?
Without expanding the size of the string structure, I think it is
possible to implement something equivalent to the following proposal
(which for clarity, does add two words to the structure). Roughly, the
idea is:
Add two PyObject pointers to the structure, component0 and component1,
and initialize them to NULL.
Alter string_concat(a, b) to return a new string object with:
ob_sval = '\0'
ob_size = len(a) + len(b)
component0 = a // be sure to INCREF(a)
component1 = b // be sure to INCREF(b)
To the start of every string method except string_len() and
string_concat(), add a macro that expands to:
if (self->component0 != NULL)
_autojoin(self);
Write an _autojoin(self) method who's goal in life is to recursively
fill-out the string from the components:
* resize self to be big enough for an ob_sized string
* treat component0 and component1 as a binary tree (with only the leaves
holding real strings) and recursively (in-order) find the leaves and
copy their string data into self->ob_sval. DECREF each component after
traversing it.
* set self->component0 to NULL to mark self as being fully joined.
The recursion depth should not be worrisome. In the typical case, one
side of the tree will be a leaf and will not recurse further. It gets
more interesting if the user writes things like b=a+a or has a nested
series like c=a+b; f=d+e; g=c+f. The recursion and reference counting
trivializes the handling of those cases.
An actual implementation striving to save the two additional structure
fields could use the ob_sstate field to indicate a deferred string (my
name for a string that has not yet had ob_sval filled out). The ob_sval
array itself would be given room to hold two object pointers for the
components and access to those pointers would be cast to (PyObject *).
If this works out, it would be an innovation. The join([a,b,c,d])
versus a+b+c+d problem also exists in several other languages. AFAICT,
Python would have been the first to solve it.
As a nice side benefit, __builtin__.sum() would no longer need to reject
the case where the inputs were strings.
There is a proof-of-concept patch for this at:
www.python.org/sf/976162
Using UserString, the patch demonstrates a simplified implementation
that converts the O(n**2) string addition process into an O(n) process
like str.join(). Since the patch passes the test suite, it seems likely
that the above could be implemented in a way that is transparent to the
user.
Raymond Hettinger
P.S. This is a repeat post. I don't think the first one made it
through mailman this week.
More information about the Python-Dev
mailing list