[Python-Dev] Wild Idea for the Year

Raymond Hettinger raymond.hettinger at verizon.net
Sat Jun 19 15:40:07 EDT 2004


What is wrong with the following code?

    def join(*iterable):
        s = ""
        for e in iterable:
            s = s + e
        return s


What if that code could run as fast as ''.join(iterable)?
How much existing code would start to perform better?
Would Python be easier to teach and write?
Would the resulting code look more natural and obvious?

Without expanding the size of the string structure, I think it is
possible to implement something equivalent to the following proposal
(which for clarity, does add two words to the structure).  Roughly, the
idea is:

Add two PyObject pointers to the structure, component0 and component1,
and initialize them to NULL.

Alter string_concat(a, b) to return a new string object with:
    ob_sval = '\0'
    ob_size = len(a) + len(b)
    component0 = a         // be sure to INCREF(a)
    component1 = b         // be sure to INCREF(b)

To the start of every string method except string_len() and
string_concat(), add a macro that expands to:

    if (self->component0 != NULL) 
        _autojoin(self);

Write an _autojoin(self) method who's goal in life is to recursively
fill-out the string from the components:

* resize self to be big enough for an ob_sized string
* treat component0 and component1 as a binary tree (with only the leaves
holding real strings) and recursively (in-order) find the leaves and
copy their string data into self->ob_sval.  DECREF each component after
traversing it.
* set self->component0 to NULL to mark self as being fully joined.

The recursion depth should not be worrisome.  In the typical case, one
side of the tree will be a leaf and will not recurse further.  It gets
more interesting if the user writes things like b=a+a or has a nested
series like c=a+b; f=d+e; g=c+f.  The recursion and reference counting
trivializes the handling of those cases.

An actual implementation striving to save the two additional structure
fields could use the ob_sstate field to indicate a deferred string (my
name for one that has not yet had ob_sval filled out).  The ob_sval
array itself would be given room to hold two object pointers for the
components and access to those pointers would be cast to (PyObject *).

If this works out, it would be an innovation.  The join([a,b]) versus
a+=b problem also exists in several other languages.  AFAICT, Python
would have been the first to solve it.

As a nice side benefit, __builtin__.sum() would no longer need to reject
the case where the inputs were strings.



Raymond Hettinger




More information about the Python-Dev mailing list