Building strings with +=
What is wrong with the following code? def join(*iterable): s = "" for e in iterable: s = s + e return s What if that code could run as fast as ''.join(iterable)? How much existing code would start to perform better? In particular, would XML code laden with += loops run faster? Would Python be easier to teach and write? Would the resulting code look more natural and obvious? Without expanding the size of the string structure, I think it is possible to implement something equivalent to the following proposal (which for clarity, does add two words to the structure). Roughly, the idea is: Add two PyObject pointers to the structure, component0 and component1, and initialize them to NULL. Alter string_concat(a, b) to return a new string object with: ob_sval = '\0' ob_size = len(a) + len(b) component0 = a // be sure to INCREF(a) component1 = b // be sure to INCREF(b) To the start of every string method except string_len() and string_concat(), add a macro that expands to: if (self->component0 != NULL) _autojoin(self); Write an _autojoin(self) method who's goal in life is to recursively fill-out the string from the components: * resize self to be big enough for an ob_sized string * treat component0 and component1 as a binary tree (with only the leaves holding real strings) and recursively (in-order) find the leaves and copy their string data into self->ob_sval. DECREF each component after traversing it. * set self->component0 to NULL to mark self as being fully joined. The recursion depth should not be worrisome. In the typical case, one side of the tree will be a leaf and will not recurse further. It gets more interesting if the user writes things like b=a+a or has a nested series like c=a+b; f=d+e; g=c+f. The recursion and reference counting trivializes the handling of those cases. An actual implementation striving to save the two additional structure fields could use the ob_sstate field to indicate a deferred string (my name for a string that has not yet had ob_sval filled out). The ob_sval array itself would be given room to hold two object pointers for the components and access to those pointers would be cast to (PyObject *). If this works out, it would be an innovation. The join([a,b,c,d]) versus a+b+c+d problem also exists in several other languages. AFAICT, Python would have been the first to solve it. As a nice side benefit, __builtin__.sum() would no longer need to reject the case where the inputs were strings. There is a proof-of-concept patch for this at: www.python.org/sf/976162 Using UserString, the patch demonstrates a simplified implementation that converts the O(n**2) string addition process into an O(n) process like str.join(). Since the patch passes the test suite, it seems likely that the above could be implemented in a way that is transparent to the user. Raymond Hettinger P.S. This is a repeat post. I don't think the first one made it through mailman this week.
On Mon, 2004-06-21 at 12:43, Raymond Hettinger wrote:
Without expanding the size of the string structure, I think it is possible to implement something equivalent to the following proposal (which for clarity, does add two words to the structure). Roughly, the idea is:
I have written many applications that store millions of strings. I don't know how keen I am on adding more eight more bytes of storage. An eight-byte string currently consumes 28 bytes of storage; the proposal would bump it up to 36 bytes.
Add two PyObject pointers to the structure, component0 and component1, and initialize them to NULL.
Alter string_concat(a, b) to return a new string object with: ob_sval = '\0' ob_size = len(a) + len(b) component0 = a // be sure to INCREF(a) component1 = b // be sure to INCREF(b)
It sounds like you're proposing a string implementation know as "ropes." See: http://www.sgi.com/tech/stl/ropeimpl.html http://citeseer.ist.psu.edu/boehm95ropes.html Are did I misunderstand? I think there's some merit to the idea. My initial reaction is that the data structure seems a bit complex. Strings are nice and simple. Another potential problem is that a lot of code purports to understand the internal representation of strings; that is, they use PyString_AS_STRING. It would be pretty easy to develop an alternative string implementation and do some performance tests without integrating it into the core. That would identify the gross characteristics. I assume most of the strings used internally, e.g. variable and attribute names, would almost always be simple strings and, thus, wouldn't be affected much by a different implementation. Jeremy
participants (2)
-
Jeremy Hylton -
Raymond Hettinger