[pypy-dev] efficient string concatenation (yep, from 2004)

Wed Feb 13 00:53:05 CET 2013

Hi friends,

_efficient string concatenation_ has been a topic in 2004.
Armin Rigo proposed a patch with the name of the subject,
more precisely:

/[Patches] [ python-Patches-980695 ] efficient string concatenation//
//on sourceforge.net, on 2004-06-28.//
/
This patch was finally added to Python 2.4 on 2004-11-30.

Some people might remember the larger discussion if such a patch should be
accepted at all, because it changes the programming style for many of us
from "don't do that, stupid" to "well, you may do it in CPython", which 
has quite
some impact on other implementations (is it fast on Jython, now?).

It changed for instance my programming and teaching style a lot, of course!

But I think nobody but people heavily involved in PyPy expected this:

Now, more than eight years after that patch appeared and made it into 2.4,
PyPy (!) still does _not_ have it!

Obviously I was mislead by other optimizations, and the fact that
this patch was from a/the major author of PyPy who invented the initial
patch for CPython. That this would be in PyPy as well sooner or later was
without question for me. Wrong... ;-)

Yes, I agree that for PyPy it is much harder to implement without the
refcounting trick, and probably even more difficult in case of the JIT.

But nevertheless, I tried to find any reference to this missing crucial 
optimization,
with no success after an hour (*).

And I guess many other people are stepping in the same trap.

So I can imagine that PyPy looses some of its speed in many programs, 
because
Armin's great hack did not make it into PyPy, and this is not loudly 
declared
somewhere. I believe the efficiency of string concatenation is something
that people assume by default and add it to the vague CPython compatibility
claim, if not explicitly told otherwise.

----

Some silly proof, using python 2.7.3 vs PyPy 1.9:

> $ cat strconc.py
> #!env python
>
> from timeit import default_timer as timer
>
> tim = timer()
>
> s = ''
> for i in xrange(100000):
>      s += 'X'
>
> tim = timer() - tim
>
> print 'time for {} concats = {:0.3f}'.format(len(s), tim)

> $ python strconc.py
> time for 100000 concats = 0.028
> $ pypy strconc.py
> time for 100000 concats = 0.804

Something is needed - a patch for PyPy or for the documentation I guess.

This is not just some unoptimized function in some module, but it is used
all over the place and became a very common pattern since introduced.

/How ironic that a foreseen problem occurs _now_, and _there_ :-)//
/
cheers -- chris

(*)
http://pypy.readthedocs.org/en/latest/cpython_differences.html
http://pypy.org/compat.html
http://pypy.org/performance.html

-- 
Christian Tismer             :^)   <mailto:tismer at stackless.com>
Software Consulting          :     Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121     :    *Starship* http://starship.python.net/
14482 Potsdam                :     PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
       whom do you want to sponsor today?   http://www.stackless.com/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20130213/38baca4d/attachment.html>