[Python-Dev] Wild Idea for the Year

Guido van Rossum guido at python.org
Thu Jun 24 01:12:01 EDT 2004


> What is wrong with the following code?
> 
>     def join(*iterable):
>         s = ""
>         for e in iterable:
>             s = s + e
>         return s
> 
> 
> What if that code could run as fast as ''.join(iterable)?
> How much existing code would start to perform better?
> Would Python be easier to teach and write?
> Would the resulting code look more natural and obvious?
> 
> Without expanding the size of the string structure, I think it is
> possible to implement something equivalent to the following proposal
> (which for clarity, does add two words to the structure).  Roughly, the
> idea is:
> 
> Add two PyObject pointers to the structure, component0 and component1,
> and initialize them to NULL.
> 
> Alter string_concat(a, b) to return a new string object with:
>     ob_sval = '\0'
>     ob_size = len(a) + len(b)
>     component0 = a         // be sure to INCREF(a)
>     component1 = b         // be sure to INCREF(b)
> 
> To the start of every string method except string_len() and
> string_concat(), add a macro that expands to:
> 
>     if (self->component0 != NULL) 
>         _autojoin(self);
> 
> Write an _autojoin(self) method who's goal in life is to recursively
> fill-out the string from the components:
> 
> * resize self to be big enough for an ob_sized string
> * treat component0 and component1 as a binary tree (with only the leaves
> holding real strings) and recursively (in-order) find the leaves and
> copy their string data into self->ob_sval.  DECREF each component after
> traversing it.
> * set self->component0 to NULL to mark self as being fully joined.
> 
> The recursion depth should not be worrisome.  In the typical case, one
> side of the tree will be a leaf and will not recurse further.  It gets
> more interesting if the user writes things like b=a+a or has a nested
> series like c=a+b; f=d+e; g=c+f.  The recursion and reference counting
> trivializes the handling of those cases.
> 
> An actual implementation striving to save the two additional structure
> fields could use the ob_sstate field to indicate a deferred string (my
> name for one that has not yet had ob_sval filled out).  The ob_sval
> array itself would be given room to hold two object pointers for the
> components and access to those pointers would be cast to (PyObject *).
> 
> If this works out, it would be an innovation.  The join([a,b]) versus
> a+=b problem also exists in several other languages.  AFAICT, Python
> would have been the first to solve it.
> 
> As a nice side benefit, __builtin__.sum() would no longer need to reject
> the case where the inputs were strings.

It goes flat against one of my *original* design goals for strings,
which was to be really simple, obviously correct code that was also
very fast for the common case.  Your idea adds some extra (constant)
time to all string ops, and quite a bit of complexity.  There are lots
of places where knowledge of string internals is assumed, including
3rd party code using a few macros, all of which would have to be
fixed.

But wait, I think it won't fly at all unless you make ob_sval a
pointer to separately allocated memory.  Otherwise, how could
_autojoin() possibly "fix" the string without allocating the memory
for it?  I am sure that making string objects contain a pointer to the
actual string data instead of having it contiguously allocated with
the header will slow things down.

--Guido van Rossum (home page: http://www.python.org/~guido/)




More information about the Python-Dev mailing list