Hi friends,
_efficient string concatenation_ has been a topic in 2004. Armin Rigo proposed a patch with the name of the subject, more precisely:
/[Patches] [ python-Patches-980695 ] efficient string concatenation// //on sourceforge.net, on 2004-06-28.// / This patch was finally added to Python 2.4 on 2004-11-30.
Some people might remember the larger discussion if such a patch should be accepted at all, because it changes the programming style for many of us from "don't do that, stupid" to "well, you may do it in CPython", which has quite some impact on other implementations (is it fast on Jython, now?).
It changed for instance my programming and teaching style a lot, of course!
But I think nobody but people heavily involved in PyPy expected this:
Now, more than eight years after that patch appeared and made it into 2.4, PyPy (!) still does _not_ have it!
Obviously I was mislead by other optimizations, and the fact that this patch was from a/the major author of PyPy who invented the initial patch for CPython. That this would be in PyPy as well sooner or later was without question for me. Wrong... ;-)
Yes, I agree that for PyPy it is much harder to implement without the refcounting trick, and probably even more difficult in case of the JIT.
But nevertheless, I tried to find any reference to this missing crucial optimization, with no success after an hour (*).
And I guess many other people are stepping in the same trap.
So I can imagine that PyPy looses some of its speed in many programs, because Armin's great hack did not make it into PyPy, and this is not loudly declared somewhere. I believe the efficiency of string concatenation is something that people assume by default and add it to the vague CPython compatibility claim, if not explicitly told otherwise.
----
Some silly proof, using python 2.7.3 vs PyPy 1.9:
$ cat strconc.py #!env python
from timeit import default_timer as timer
tim = timer()
s = '' for i in xrange(100000): s += 'X'
tim = timer() - tim
print 'time for {} concats = {:0.3f}'.format(len(s), tim)
$ python strconc.py time for 100000 concats = 0.028 $ pypy strconc.py time for 100000 concats = 0.804
Something is needed - a patch for PyPy or for the documentation I guess.
This is not just some unoptimized function in some module, but it is used all over the place and became a very common pattern since introduced.
/How ironic that a foreseen problem occurs _now_, and _there_ :-)// / cheers -- chris
(*) http://pypy.readthedocs.org/en/latest/cpython_differences.html http://pypy.org/compat.html http://pypy.org/performance.html
Hi Christian.
We have it, just not enabled by default. --objspace-with-strbuf I think
On Wed, Feb 13, 2013 at 1:53 AM, Christian Tismer tismer@stackless.com wrote:
Hi friends,
efficient string concatenation has been a topic in 2004. Armin Rigo proposed a patch with the name of the subject, more precisely:
[Patches] [ python-Patches-980695 ] efficient string concatenation on sourceforge.net, on 2004-06-28.
This patch was finally added to Python 2.4 on 2004-11-30.
Some people might remember the larger discussion if such a patch should be accepted at all, because it changes the programming style for many of us from "don't do that, stupid" to "well, you may do it in CPython", which has quite some impact on other implementations (is it fast on Jython, now?).
It changed for instance my programming and teaching style a lot, of course!
But I think nobody but people heavily involved in PyPy expected this:
Now, more than eight years after that patch appeared and made it into 2.4, PyPy (!) still does _not_ have it!
Obviously I was mislead by other optimizations, and the fact that this patch was from a/the major author of PyPy who invented the initial patch for CPython. That this would be in PyPy as well sooner or later was without question for me. Wrong... ;-)
Yes, I agree that for PyPy it is much harder to implement without the refcounting trick, and probably even more difficult in case of the JIT.
But nevertheless, I tried to find any reference to this missing crucial optimization, with no success after an hour (*).
And I guess many other people are stepping in the same trap.
So I can imagine that PyPy looses some of its speed in many programs, because Armin's great hack did not make it into PyPy, and this is not loudly declared somewhere. I believe the efficiency of string concatenation is something that people assume by default and add it to the vague CPython compatibility claim, if not explicitly told otherwise.
Some silly proof, using python 2.7.3 vs PyPy 1.9:
$ cat strconc.py #!env python
from timeit import default_timer as timer
tim = timer()
s = '' for i in xrange(100000): s += 'X'
tim = timer() - tim
print 'time for {} concats = {:0.3f}'.format(len(s), tim)
$ python strconc.py time for 100000 concats = 0.028 $ pypy strconc.py time for 100000 concats = 0.804
Something is needed - a patch for PyPy or for the documentation I guess.
This is not just some unoptimized function in some module, but it is used all over the place and became a very common pattern since introduced.
How ironic that a foreseen problem occurs _now_, and _there_ :-)
cheers -- chris
(*) http://pypy.readthedocs.org/en/latest/cpython_differences.html http://pypy.org/compat.html http://pypy.org/performance.html
-- Christian Tismer :^) mailto:tismer@stackless.com Software Consulting : Have a break! Take a ride on Python's Karl-Liebknecht-Str. 121 : *Starship* http://starship.python.net/ 14482 Potsdam : PGP key -> http://pgp.uni-mainz.de phone +49 173 24 18 776 fax +49 (30) 700143-0023 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
pypy-dev mailing list pypy-dev@python.org http://mail.python.org/mailman/listinfo/pypy-dev
Something is needed - a patch for PyPy or for the documentation I guess.
Not arguing that it wouldn't be good, but I disagree that it is needed.
This is only an issue when you, as in your proof, have a loop that does concatenation. This is usually when looping over a list of strings that should be concatenated together. Doing so in a loop with concatenation may be the natural way for people new to Python, but the "natural" way to do it in Python is with a ''.join() call.
This:
s = ''.join(('X' for x in xrange(x)))
Is more than twice as fast in Python 2.7 than your example. It is in fact also slower in PyPy 1.9 than Python 2.7, but only with a factor of two:
Python 2.7: time for 10000000 concats = 0.887 Pypy 1.9: time for 10000000 concats = 1.600
(And of course s = 'X'* x takes only a bout a hundredth of the time, but that's cheating. ;-)
//Lennart
On 13.02.13 08:42, Lennart Regebro wrote:
Something is needed - a patch for PyPy or for the documentation I guess.
Not arguing that it wouldn't be good, but I disagree that it is needed.
This is only an issue when you, as in your proof, have a loop that does concatenation. This is usually when looping over a list of strings that should be concatenated together. Doing so in a loop with concatenation may be the natural way for people new to Python, but the "natural" way to do it in Python is with a ''.join() call.
This:
s = ''.join(('X' for x in xrange(x)))
Is more than twice as fast in Python 2.7 than your example. It is in fact also slower in PyPy 1.9 than Python 2.7, but only with a factor of two:
Python 2.7: time for 10000000 concats = 0.887 Pypy 1.9: time for 10000000 concats = 1.600
(And of course s = 'X'* x takes only a bout a hundredth of the time, but that's cheating. ;-)
This is not about how to write efficient concatenation and not for me. It is also not about a constant factor, which I don't really care about but in situations where speed matters.
This is about a possible algorithmic trap, where code written for CPython may behave well with some roughly O(n) behavior, and by switching to PyPy you get a surprise when the same code now has O(n**2) behavior. Such runtime explosions can damage the trust in PyPy, with code sitting in some module which you even did not write but "pip install"-ed it.
So this is important to know, especially for newcomers, and for people who are giving advice to them. For algorithmic compatibility, there should no longer be a feature with this drastic side effect, if that cannot be supported by all other dialects.
To avoid such hidden traps in larger code bases, documentation is needed that clearly gives a warning saying "don't do that", like CS students learn for most other languages.
cheers - chris
On Wed, Feb 13, 2013 at 10:06 PM, Christian Tismer tismer@stackless.com wrote:
To avoid such hidden traps in larger code bases, documentation is needed that clearly gives a warning saying "don't do that", like CS students learn for most other languages.
How much more explicit do you want us to be?
"""6. CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations."""
from http://docs.python.org/2/library/stdtypes.html#typesseq
So please don't blame us for people not reading a warning that is already there.
Since my rewrite of the sequence docs, Python 3 doesn't even acknowledge the hack's existence and is quite explicit about what you need to do to get reliably linear behaviour:
"""6. Concatenating immutable sequences always results in a new object. This means that building up a sequence by repeated concatenation will have a quadratic runtime cost in the total sequence length. To get a linear runtime cost, you must switch to one of the alternatives below:
if concatenating str objects, you can build a list and use str.join() at the end or else write to a io.StringIO instance and retrieve its value when complete if concatenating bytes objects, you can similarly use bytes.join() or io.BytesIO, or you can do in-place concatenation with a bytearray object. bytearray objects are mutable and have an efficient overallocation mechanism if concatenating tuple objects, extend a list instead for other types, investigate the relevant class documentation"""
from http://docs.python.org/3/library/stdtypes.html#common-sequence-operations
Deliberately *relying* on the += hack to avoid quadratic runtime is just plain wrong, and our documentation already says so.
If anyone really thinks it will help, I can add a CPython implementation note back in to the Python 3 docs as well, pointing out that CPython performance measurements may hide broken algorithmic complexity related to string concatenation, but the corresponding note in Python 2 doesn't seem to have done much good :P
Regards, Nick.
Hey Nick,
On 13.02.13 15:44, Nick Coghlan wrote:
On Wed, Feb 13, 2013 at 10:06 PM, Christian Tismer tismer@stackless.com wrote:
To avoid such hidden traps in larger code bases, documentation is needed that clearly gives a warning saying "don't do that", like CS students learn for most other languages.
How much more explicit do you want us to be?
"""6. CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations."""
from http://docs.python.org/2/library/stdtypes.html#typesseq
So please don't blame us for people not reading a warning that is already there.
I don't, really not. This was a cross-posting effect. I was using the PyPy documentation, only, and there a lot of things are mentioned, but this behavioral difference was missing. Python-dev was not addressed at all.
... Deliberately *relying* on the += hack to avoid quadratic runtime is just plain wrong, and our documentation already says so.
If anyone really thinks it will help, I can add a CPython implementation note back in to the Python 3 docs as well, pointing out that CPython performance measurements may hide broken algorithmic complexity related to string concatenation, but the corresponding note in Python 2 doesn't seem to have done much good :P
Well, while we are at it: Yes, it says so as a note at the end of http://docs.python.org/2/library/stdtypes.html#typesseq
I doubt that many people read that far, and they do not search documentation about sequence types when they are adding some strings together. People seem to have a tendency to just try something out instead and see if it works. That even seems to get worse the better and bigger the Python documentation grows. ;-)
Maybe it would be a good idea to remove that concat optimization completely? Then people will wake up and read the docs to find out what's wrong ;-) No, does not help, because their test cases will not cover the reality.
----- Thinking a bit more about it.
If you think about docs improvement, I don't believe it helps to make the very complete reference documentation even more complete. Completeness is great, don't take me wrong! But what people read is what pops right into their face, and I think that could be added.
I think before getting people to work through long and complete documentation, it is probably easier to wake their interest by something like "Hey, are you doing things this way?" And then there is a short, concise list of bad and good things, maybe even dynamic as in WingWare's "Wing Tips" or any better approach.
From that easily reachable, only a few pages long tabular collection of short hints and caveats there could be linkage to the existing, real documentation that explains things in more detail. Maybe that could be a way to get people to actually read.
Just an idea.
cheers - Chris
p.s.: Other nice attempts that don't seem to really work:
Some hints like http://docs.python.org/2/howto/doanddont.html are not bad, although that is hidden in the HowTO section, does only address a few things, and also the sub-title "in-depth documents on specific topics" is not what they seek in the first place while hacking on some code.
Looking things up in a quick ref like http://rgruet.free.fr/PQR27/PQR2.7.html is very concise but does also _not_ mention what to avoid. Others exist, like http://infohost.nmt.edu/tcc/help/pubs/python/web/
By the way, the first thing I find via google is: http://www.python.org/doc/QuickRef.html which is quite funny (v1.3)
On Wed, 13 Feb 2013 18:07:22 +0100, Christian Tismer tismer@stackless.com wrote:
I think before getting people to work through long and complete documentation, it is probably easier to wake their interest by something like "Hey, are you doing things this way?" And then there is a short, concise list of bad and good things, maybe even dynamic as in WingWare's "Wing Tips" or any better approach.
From that easily reachable, only a few pages long tabular collection of short hints and caveats there could be linkage to the existing, real documentation that explains things in more detail. Maybe that could be a way to get people to actually read.
There used to be a HOWTO with this goal, but its opinions were considered outdated and/or contentious, and it was deleted:
http://docs.python.org/2.6/howto/doanddont.html
--David
On 14/02/13 01:44, Nick Coghlan wrote:
Deliberately *relying* on the += hack to avoid quadratic runtime is just plain wrong, and our documentation already says so.
+1
I'm not sure that there's any evidence that people in general are *relying* on the += hack. More likely they write the first code they think of, which is +=, and never considered the consequences or test it thoroughly. Even if they test it, they only test it on one version of one implementation on one platform, and likely only with small N.
Besides, if you know that N will always be small, then using += is not wrong.
I think we have a tendency to sometimes overreact in cases like this. I don't think we need to do any more than we are already doing: the tutor@ and python-list@ mailing lists already try to educate users to use join, the docs recommend to use join, the Internet is filled with code that correctly uses join. What more can we do? I see no evidence that the Python community is awash with coders who write code with atrocious performance characteristics, or at least no more than any other language.
Hi Lennart,
Sent from my Ei4Steve
On Feb 13, 2013, at 8:42, Lennart Regebro regebro@gmail.com wrote:
Something is needed - a patch for PyPy or for the documentation I guess.
Not arguing that it wouldn't be good, but I disagree that it is needed.
This is only an issue when you, as in your proof, have a loop that does concatenation. This is usually when looping over a list of strings that should be concatenated together. Doing so in a loop with concatenation may be the natural way for people new to Python, but the "natural" way to do it in Python is with a ''.join() call.
This:
s = ''.join(('X' for x in xrange(x)))
Is more than twice as fast in Python 2.7 than your example. It is in fact also slower in PyPy 1.9 than Python 2.7, but only with a factor of two:
Python 2.7: time for 10000000 concats = 0.887 Pypy 1.9: time for 10000000 concats = 1.600
(And of course s = 'X'* x takes only a bout a hundredth of the time, but that's cheating. ;-)
//Lennart
This all does not really concern me, as long as it roughly has the same order of magnitude, or better the same big Oh. I'm not concerned by a constant factor. I'm concerned by a freezing machine that suddenly gets 10000 times slower because the algorithms never explicitly state their algorithmic complexity. ( I think I said this too often, today?)
As a side note: Something similar happened to me when somebody used "range" in Python3.3. He ran the same code on Python 2.7. with a crazy effect of having to re-boot: Range() on 2.7 with arguments from some arbitrary input file. A newbie error that was hard to understand, because he was tought thinking 'xrange' when writing 'range'. Hard for me to understand because I am no longer able to make these errors at all, or even expect them.
Cheers - Chris
On 13/02/13 10:53, Christian Tismer wrote:
Hi friends,
_efficient string concatenation_ has been a topic in 2004. Armin Rigo proposed a patch with the name of the subject, more precisely:
/[Patches] [ python-Patches-980695 ] efficient string concatenation// //on sourceforge.net, on 2004-06-28.// / This patch was finally added to Python 2.4 on 2004-11-30.
Some people might remember the larger discussion if such a patch should be accepted at all, because it changes the programming style for many of us from "don't do that, stupid" to "well, you may do it in CPython", which has quite some impact on other implementations (is it fast on Jython, now?).
I disagree. If you look at the archives on the python-list@ and tutor@python.org mailing lists, you will see that whenever string concatenation comes up, the common advice given is to use join.
The documentation for strings is also clear that you should not rely on this optimization:
http://docs.python.org/2/library/stdtypes.html#typesseq
And quadratic performance for repeated concatenation is not unique to Python: it applies to pretty much any language with immutable strings, including Java, C++, Lua and Javascript.
It changed for instance my programming and teaching style a lot, of course!
Why do you say, "Of course"? It should not have changed anything.
Best practice remains the same:
- we should still use join for repeated concatenations;
- we should still avoid + except for small cases which are not performance critical;
- we should still teach beginners to use join;
- while this optimization is nice to have, we cannot rely on it being there when it matters.
It's not just Jython and IronPython that can't make use of this optimization. It can, and does, fail on CPython as well, as it is sensitive to memory allocation details. See for example:
http://utcc.utoronto.ca/~cks/space/blog/python/ExaminingStringConcatOpt
and here for a cautionary tale about what can happen when the optimization fails under CPython:
http://mail.python.org/pipermail/python-dev/2009-August/091125.html
On 13.02.13 13:10, Steven D'Aprano wrote:
On 13/02/13 10:53, Christian Tismer wrote:
Hi friends,
_efficient string concatenation_ has been a topic in 2004. Armin Rigo proposed a patch with the name of the subject, more precisely:
/[Patches] [ python-Patches-980695 ] efficient string concatenation// //on sourceforge.net, on 2004-06-28.// / This patch was finally added to Python 2.4 on 2004-11-30.
Some people might remember the larger discussion if such a patch should be accepted at all, because it changes the programming style for many of us from "don't do that, stupid" to "well, you may do it in CPython", which has quite some impact on other implementations (is it fast on Jython, now?).
I disagree. If you look at the archives on the python-list@ and tutor@python.org mailing lists, you will see that whenever string concatenation comes up, the common advice given is to use join.
The documentation for strings is also clear that you should not rely on this optimization:
http://docs.python.org/2/library/stdtypes.html#typesseq
And quadratic performance for repeated concatenation is not unique to Python: it applies to pretty much any language with immutable strings, including Java, C++, Lua and Javascript.
It changed for instance my programming and teaching style a lot, of course!
Why do you say, "Of course"? It should not have changed anything.
You are right, I was actually over the top with my rant and never recommend string concatenation when working with real amounts of data. The surprise was just so big.
I tend to use whatever fits best for small initialization of some modules, where the fact that concat is cheap lets me stop thinking of big Oh. Although it probably does not matter much, it makes me feel incomfortable to do something with potentially bad asymptotics.
Best practice remains the same:
we should still use join for repeated concatenations;
we should still avoid + except for small cases which are not
performance critical;
we should still teach beginners to use join;
while this optimization is nice to have, we cannot rely on it being
there when it matters.
I agree that CPython does say this clearly. Actually I was complaining about the PyPy documentation which does not mention this, and because PyPy is so very compatible already.
2004 when this stuff came up was the time where PyPy already was quite active, but the Psyco mindset was still around, too. Maybe my slightly shocked reaction originates from there, and my implicit assumption was never corrected ;-)
cheers - chris
Steven D'Aprano wrote:
The documentation for strings is also clear that you should not rely on this optimization:
...
It can, and does, fail on CPython as well, as it is sensitive to memory allocation details.
If it's that unreliable, why was it ever implemented in the first place?
On Wed, Feb 13, 2013 at 11:17 PM, Greg Ewing greg.ewing@canterbury.ac.nz wrote:
Steven D'Aprano wrote:
The documentation for strings is also clear that you should not rely on this optimization:
...
It can, and does, fail on CPython as well, as it is sensitive to memory allocation details.
If it's that unreliable, why was it ever implemented in the first place?
Because someone thought it's a good idea probably and other people asked for a review said +1 :)
On 13.02.13 22:52, Maciej Fijalkowski wrote:
On Wed, Feb 13, 2013 at 11:17 PM, Greg Ewing greg.ewing@canterbury.ac.nz wrote:
Steven D'Aprano wrote:
The documentation for strings is also clear that you should not rely on this optimization:
...
It can, and does, fail on CPython as well, as it is sensitive to memory allocation details.
If it's that unreliable, why was it ever implemented in the first place?
The _trick_ was very good, the idea was - uhm - arguable. I wished I had objected, but at that time I was only fascinated.
-- chris
(There are parallels on a larger scale, but I'm shutting up, intentionally.)
Hi Greg,
On Wed, Feb 13, 2013 at 10:17 PM, Greg Ewing greg.ewing@canterbury.ac.nz wrote:
If it's that unreliable, why was it ever implemented in the first place?
I was young and loved hacks and python-dev felt that it was a good idea at the time (http://mail.python.org/pipermail/python-dev/2004-August/046686.html).
A bientôt,
Armin.